Implementations
- Multi-Head Attention (MHA)
- the standard mechanism in the original Transformer
- each head uses separate Query, Key, and Value projections
- Grouped Query Attention (GQA)
- a more efficient variant where Key and Value projections are shared across groups of attention heads, while Query projections are still separate
- Used in LLaMA-2
- Multi-Query Attention (MQA)
- a variant where all heads share the same Key and Value projections, while Query projections are still separate
- Used in GPT-4
- Multi-Head Latent Attention (MLA)
- compress Keys and Values into a low-rank latent space, reducing KV cache requirements
- Native Sparse Attention (NSA): use 3 lenses
- compressed lens (global view, main ideas)
- selected lens (important details)
- sliding lens (recent context)
- Flash Attention 3
- hardware optimization
Properties
- Flash Attention
- not a different attention mechanism, but a more efficient implementation of MHA
- uses tiling and recomputation strategies
- Sparse Attention
- attends only to a subset of positions (e.g, local windows, strided patterns)
- Linear Attention
- approximates attention with linear complexity
- Cross-Attention
- used in encoder-decoder models for attending between different sequences
Landscape of Attention Mechanisms (August 2025)
- Memory Efficiency is Paramount: The shift from MHA to GQA/MLA is driven by the need to reduce KV cache size for longer contexts
- Hardware-Software Co-design: Flash Attention 3’s success shows the importance of optimizing for specific hardware (H100 GPUs)
- Low-rank Approximations: MLA’s success demonstrates that intelligent compression can maintain quality while drastically reducing memory usage
- Sparse Attention Gaining Ground: Both Claude 3 and DeepSeek’s NSA show sparse attention becoming practical
- Context Length Arms Race: Models now support 100K+ tokens routinely, with some reaching 1M+ tokens