Variants

Implementations

Multi-Head Attention (MHA)
- the standard mechanism in the original Transformer
- each head uses separate Query, Key, and Value projections
Grouped Query Attention (GQA)
- a more efficient variant where Key and Value projections are shared across groups of attention heads, while Query projections are still separate
- Used in LLaMA-2
Multi-Query Attention (MQA)
- a variant where all heads share the same Key and Value projections, while Query projections are still separate
- Used in GPT-4
Multi-Head Latent Attention (MLA)
- compress Keys and Values into a low-rank latent space, reducing KV cache requirements
Native Sparse Attention (NSA): use 3 lenses
- 1. compressed lens (global view, main ideas)
- 1. selected lens (important details)
- 1. sliding lens (recent context)
Flash Attention 3
- hardware optimization

Properties

Flash Attention
- not a different attention mechanism, but a more efficient implementation of MHA
- uses tiling and recomputation strategies
Sparse Attention
- attends only to a subset of positions (e.g, local windows, strided patterns)
Linear Attention
- approximates attention with linear complexity
Cross-Attention
- used in encoder-decoder models for attending between different sequences

Landscape of Attention Mechanisms (August 2025)

Memory Efficiency is Paramount: The shift from MHA to GQA/MLA is driven by the need to reduce KV cache size for longer contexts
Hardware-Software Co-design: Flash Attention 3’s success shows the importance of optimizing for specific hardware (H100 GPUs)
Low-rank Approximations: MLA’s success demonstrates that intelligent compression can maintain quality while drastically reducing memory usage
Sparse Attention Gaining Ground: Both Claude 3 and DeepSeek’s NSA show sparse attention becoming practical
Context Length Arms Race: Models now support 100K+ tokens routinely, with some reaching 1M+ tokens