KV cache trades space for time
Without KV cache: when generating token N, we need
- recompute the Key and Value matrices for all previous tokens (1 through N-1)
- compute attention using these Keys and Values
With KV cache:
- store the Key and Value vectors for each token after computing them once
- when generating new tokens, reuse the cached K,V vectors from previous tokens
- only compute new K,V vectors for the current token being generated
Optimization techniques:
- MQA/GQA: Reduce KV cache by sharing K,V across heads
- MLA: Compresses K,V into low-rank representations
- Sliding window: Only cache recent tokens
- KV cache quantization: Store K,V in lower precision (e.g., INT8)