A 20-Year-Old Algorithm Can Help Us Understand Transformer Embeddings

https://news.ycombinator.com/rss Hits: 10
Summary

Suppose we ask an LLM: “Can you tell me about Java?” What “Java” is the model thinking about? The programming language or the Indonesian island? To answer this question, we can try to understand what is going on inside the model. Specifically, we want to represent the model’s internal states in a human-interpretable way by finding the concepts that the model is thinking about. One approach to this problem is to phrase it as a dictionary learning problem, in which we try to decompose complex embeddings into a sum of simple and interpretable concept vectors. These vectors are selected from a learned dictionary. It is not obvious that we can break down embeddings as a linear sum of interpretable elements. However, in 2022, Elhage et. al introduced supporting evidence for the “superposition hypothesis,” which suggests that this superposition of monosemantic concept vectors is a good model for the complex embeddings. Finding these concept vectors remains an ongoing challenge. When dictionary learning was first proposed for this problem by Bricken et al. in 2023, they used a single-layer neural network called a sparse autoencoder (SAE) to learn the dictionary, which has since become widely popular. The problem of dictionary learning actually goes way back (pre-2000s!), but Bricken et al. opted to forego established algorithms in favor of SAEs for two main reasons: “First, a sparse autoencoder can readily scale to very large datasets, which we believe is necessary to characterize the features present in a model trained on a large and diverse corpus. […] Secondly, we have a concern that iterative dictionary learning methods might be “too strong”, in the sense of being able to recover features from the activations which the model itself cannot access. Exact compressed sensing is NP-hard, which the neural network certainly isn’t doing.” In our recent paper, we show that with minor modifications, traditional methods can be scaled to sufficiently large datasets with millions of...

First seen: 2025-08-31 15:44

Last seen: 2025-09-01 00:46