self

Attention(Q,K,V)=softmax(QKTdk)VAttention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V

Causal Masking:

Scalability:

Multi-head:

Parameter calculation: