The Tradeoffs of SSMs and Transformers

https://news.ycombinator.com/rss Hits: 10
Summary

This blog post was adapted from a talk I’ve given a handful of times over the last year. It was meant to be a high-level talk accessible to a fairly broad audience, but hopefully has some interesting insights, opinions, and intuitions around sequence models for the dedicated researchers too. State Space Models Just so we’re on the same page, I’ll start by defining what I mean by a state space model. (This section isn’t strictly necessary to get to the main part of this post though; feel free to skip directly to the next section.) \[\begin{equation} \label{eq:ssm} \begin{aligned} h_{t} &= A_t h_{t-1} + B_t x_t \\ y_t &= C_t^{\top} h_t \end{aligned} \end{equation}\] These equations define the (structured) state space model (SSM) as developed in a line of work culminating in Mamba . They can be viewed as a modern version of a recurrent neural network (RNN) with a few key characteristics. While a lot of technical work was involved in getting this family of models to work, I’ll start by trying to abstract away what I view as the main high-level ingredients that made these models successful, e.g. match the performance of Transformers on language modeling. The three ingredients 1. State size A characteristic of the SSM is that its hidden state $h_t$ has a larger size than the the inputs and outputs $x_t, y_t$. The key idea is that the hidden state of any recurrent model is its only access to the model’s context (in an autoregressive setting). So for modeling information-dense modalities such as language, the model needs a large enough state to store the relevant information that it wants to access later. In SSMs, if each input $x_t$ is a 1-dimensional scalar, then the hidden state $h_t$ is an $\mathtt{N}$-dimensional vector, where $\mathtt{N}$ is an independent hyperparameter called the state size, state dimension, or state expansion factor. This is also known as a SISO (single-input single-output) SSM and allows the models to store $\mathtt{N}$ times as much information a...

First seen: 2025-07-08 20:32

Last seen: 2025-07-09 05:33