Prompting by Activation Maximization

https://news.ycombinator.com/rss Hits: 2
Summary

Prompting by Activation Maximization Prompt synthesis with activation maximization achieves 95.9% on Yelp Review Polarity (sentiment classificaton task) with Llama-3.2-1B-Instruct and a 4-token prompt, vs. 57% with hand-written prompt. Code on on GitHub. Motivation I'm learning PyTorch, and I'm really into activation maximization. I see a lot of potential beyond tricking classifiers and mechanistic intepretation. I've got a project in mind, and I'm building up to it. This experiment is a step on that journey. Activation Maximization? The model is a function with adjustable weights (coefficients). We arrange a dataset of inputs paired with ideal outputs, and adjust the model's weights to minimize the difference between the function's actual and expected output. With even a handful of coefficients, the search space is tremendous, but backpropagation and gradient descent help find useful minima representing useful and productive functions. Activation maximization inverts this. Having a trained model and target output, you can adjust the input instead of the weights. The result is an input which provokes a desired output from the model. In PyTorch, this is easy. You initialize some subject, name it as the optimizer's target, freeze the model weights and reorganize the training loop. Loss converges, like any other training session. Basic Experiment For my first experiment, I used the MNIST dataset, which has 60,000 examples of handwritten digits. (You can find MNIST on Kaggle.) On this, I trained a basic stack with three convolutional networks feeding into a two-layer perceptron. This scored 99% accuracy on the 10,000 sample benchmark. MNIST is a labeled set of written digits. Then, having the model, I initialize an image with random noise, and name it as the optimizer's target. import torch from torch import nn device = 'cuda' from model import Classifier filename = 'classifier.202508042316.pt' model = Classifier() model.load_state_dict( torch.load( filename, weights_on...

First seen: 2025-08-16 06:25

Last seen: 2025-08-16 07:25