Attention Wasn't All We Needed There's a lot of modern techniques that have been developed since the original Attention Is All You Need paper. Let's look at some of the most important ones that have been developed over the years and try to implement the basic ideas as succinctly as possible. We'll use the Pytorch framework for most of the examples. Note that most of these examples are highly simplified sketches of the core ideas, if you want the full implementation please read the original paper or the production code in frameworks like PyTorch or Jax. Group Query Attention Multi-head Latent Attention Flash Attention Ring Attention Pre-normalization RMSNorm SwiGLU Rotary Positional Embedding Mixture of Experts Learning Rate Warmup Cosine Schedule AdamW Optimizer Multi-token Prediction Speculative Decoding Group Query Attention Ok starting off in no particular order, Grouped Query Attention is a technique to reduce the memory usage of the KV cache during inference. Group Query Attention is an architectural optimization for the standard multi-head attention mechanism. The core idea behind GQA is based on the observation that the computational bottleneck and memory footprint in MHA are heavily influenced by the size of the K and V projections and their corresponding caches. GQA proposes to reduce this cost by sharing a single set of K and V projections across multiple Q heads. Instead of having \(N_h\) distinct heads for Q, K, and V (as in MHA), GQA uses \(N_h\) query heads but only \(N_{kv}\) key/value heads, where \(N_{kv} < N_h\) and \(N_h\) is typically a multiple of \(N_{kv}\). These \(N_h\) query heads are divided into \(N_{kv}\) groups, with each group of \(N_h / N_{kv}\) query heads attending to the same key and value head. This structure significantly reduces the parameter count for K and V projection matrices and, more importantly, shrinks the size of the K/V cache needed during autoregressive decoding. Let the input sequence representation be \(X \in \mathbb...
First seen: 2025-05-23 19:31
Last seen: 2025-05-23 22:31