self

$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$

$X$ : context-free embedding of the input sequence
- $X \in R^{n \times d_{model}}$
- $n$ : sequence length
- $d_{model}$ : embedding dimension
$Q = X W_Q$
$K = X W_K$
$V = X W_V$
- where $W_Q, W_K, W_V$ are learnable parameters, called projection matrices
- $W_Q \in R^{d_{model} \times d_k}$ : what to look for
- $W_K \in R^{d_{model} \times d_k}$ : what to compare with
- $W_V \in R^{d_{model} \times d_v}$ : what information to extract
Attention:
- score: $S = QK^T$
- scaled: $S_{scaled} = \frac{S}{\sqrt{d_k}}$
- softmax: $A = softmax(S_{scaled})$
Final output: $Attention(Q, K, V) = AV$
- the whole process lifts the context-free embedding into a contextualized representation
$d_k$ and $d_v$ are hyperparameters
- often $d_k = d_v = d_{model} / h$
- $h$ : number of attention heads

Causal Masking:

mask = [
[0, -∞, -∞],
[0, 0, -∞],
[0, 0, 0]
]
$A = softmax(S_{scaled} + mask)$
In decoder self-attention, we’re generating from left to right. During training, we have the full target sequence available, but we don’t want to attend to future positions.

Scalability:

Time complexity: $O(n^2 d_{model})$
Space complexity: $O(n^2)$ $O (n^{2})$
- for storing the attention matrix ( $S_{scaled}$ )
Quadratic scaling in context length is the reason why transformer struggles in long context

Multi-head:

Parameter calculation: