Today’s leading language models contain upwards of a trillion parameters, pretrained on tens of trillions of tokens. Base model performance keeps improving with scale, as these trillions are necessary for learning and representing all the patterns in written-down human knowledge. In contrast, post-training involves smaller datasets and generally focuses on narrower domains of knowledge and ranges of behavior. It seems wasteful to use a terabit of weights to represent updates from a gigabit or megabit of training data. This intuition has motivated parameter efficient fine-tuning (PEFT), which adjusts a large network by updating a much smaller set of parameters. The leading PEFT method is low-rank adaptation, or LoRA. LoRA replaces each weight matrix W from the original model with a modified version $W' = W + \gamma BA$, where B and A are matrices that together have far fewer parameters than W, and $\gamma$ is a constant scaling factor. In effect, LoRA creates a low-dimensional representation of the updates imparted by fine-tuning. LoRA may offer advantages in the cost and speed of post-training, and there are also a few operational reasons to prefer it to full fine-tuning (henceforth, FullFT): Multi-tenant serving. Since LoRA trains an adapter (i.e., the A and B matrices) while keeping the original weights unchanged, a single inference server can keep many adapters (different model versions) in memory and sample from them simultaneously in a batched way.Punica: Multi-Tenant LoRA Serving (Chen, Ye, et al, 2023) Modern inference engines such as vLLM and SGLang implement this feature. Layout size for training. When fine-tuning the whole model, the optimizer state needs to be stored along with the original weights, often at higher precision. As a result, FullFT usually requires an order of magnitude more accelerators than sampling from the same model does, and thus a different layout.For training, besides storing the weights, we typically need to store gradients and opti...
First seen: 2025-10-03 22:54
Last seen: 2025-10-04 16:58