Hacking Diffusion into Qwen3 for the Arc Challenge

https://news.ycombinator.com/rss Hits: 15

Summary

Overview I've been playing with the ARC AGI Prize recently1. There are a lot of things that surprised me while replicating the approach of last year's grand prize winner, "The ARChitects"2. For example, when the model was less sure about a pixel, the solution was more likely wrong. Another was that a per-pixel token encoding even worked. It was like the poor model was forced to type its solutions out, character by character, on a typewriter with no backspace. Imagine solving a jigsaw puzzle but you have to place pieces starting from the top-left corner in order. No jumping around. No doing the edges first. That's what we're making these models do. Instead of forcing the model to work in typewriter-order, what if we let it fill in the easy parts first? Maybe you can use the uncertainty to measure what is obvious and what is tricky. As it turns out, it works: Video 1: A comparison of autoregressive (left) and diffusion (right) approaches on the same task. The diffusion model first fills in "easy" tokens and works its way in to more complicated tokens. Based on recent work in converting autoregressive LLMs into diffusion models3, I took my autoregressive model and hacked it to be able to decode in any order. Then I had it unmask tokens it was more confident about first. You can see more animated detail below in How Generation Works. Skipping ahead a bit to the results: it still needs more work. My diffusion approach is faster at 10 timesteps and achieves modestly better token accuracy, but this doesn't translate into solving more tasks. At 30 timesteps - where it finally matches the baseline's task success rate - it's actually slower than autoregressive. Why? One thing the typewriter approach has going for it is that the constraint of only moving forward makes it easy to cache. My implementation can't do this. I've converted the "decoder" style LLM into a single fully connected "encoder" style, where activations of previous tokens are allowed to change based on later t...

First seen: 2025-08-08 23:31

Last seen: 2025-08-09 13:34

Read Full Article More from this Source

Hacking Diffusion into Qwen3 for the Arc Challenge

Summary

Related News

Ask HN: How can ChatGPT serve 700M users when I can't run one GPT-4 locally?

I prefer human-readable file formats

Private Welsh island with 19th century fort goes on the market

Datalog-Based Binary Equivalence

Representing Python notebooks as dataflow graphs