Writing an LLM from scratch, part 13 – attention heads are dumb

https://news.ycombinator.com/rss Hits: 21
Summary

Archives Categories Blogroll Now that I've finished chapter 3 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)" -- having worked my way through multi-head attention in the last post -- I thought it would be worth pausing to take stock before moving on to Chapter 4. There are two things I want to cover, the "why" of self-attention, and some thoughts on context lengths. This post is on the "why" -- that is, why do the particular set of matrix multiplications described in the book do what we want them to do? As always, this is something I'm doing primarily to get things clear in my own head -- with the possible extra benefit of it being of use to other people out there. I will, of course, run it past multiple LLMs to make sure I'm not posting total nonsense, but caveat lector! Let's get into it. As I wrote in part 8 of this series: I think it's also worth noting that [what's in the book is] very much a "mechanistic" explanation -- it says how we do these calculations without saying why. I think that the "why" is actually out of scope for this book, but it's something that fascinates me, and I'll blog about it soon. That "soon" is now :-) Attention heads are dumb I think that my core problem with getting my head around why these equations work was that I was overestimating what a single attention head could do. In part 6, I wrote, of the phrase "the fat cat sat on the mat": So while the input embedding for "cat" just means "cat in position 3", the context vector for "cat" in this sentence also has some kind of overtones about it being a cat that is sitting, perhaps less strongly that it's a specific cat ("the" rather than "a"), and hints of it being sitting on a mat. The thing that I hadn't understood was that this is true in as far as it goes, but only for the output of the attention mechanism as a whole -- not for a single attention head. Each individual attention head is really dumb, and what it's doing is much simpler than that! The two thing...

First seen: 2025-05-11 16:23

Last seen: 2025-05-12 12:27