LLM Alloying Improves Performance over Single Model

https://news.ycombinator.com/rss Hits: 14
Summary

This spring, we had a simple and, to my knowledge, novel idea that turned out to dramatically boost the performance of our vulnerability detection agents at XBOW. On fixed benchmarks and with a constrained number of iterations, we saw success rates rise from 25% to 40%, and then soon after to 55%. The principles behind this idea are not limited to cybersecurity. They apply to a large class of agentic AI setups. Let me share. XBOW’s Challenge XBOW is an autonomous pentester. You point it at your website, and it tries to hack it. If it finds a way in (something XBOW is rather good at), it reports back so you can fix the vulnerability. It’s autonomous, which means: once you’ve done your initial setup, no further human handholding is allowed. There is quite a bit to do and organize when pentesting an asset. You need to run discovery and create a mental model of the website, its tech stack, logic, and attack surface, then keep updating that mental model, building up leads and discarding them by systematically probing every part of it in different ways. That’s an interesting challenge, but not what this blog post is about. I want to talk about one particular, fungible subtask that comes up hundreds of times in each test, and for which we’ve built a dedicated subagent: you’re pointed at a part of the attack surface knowing the genre of bug you’re supposed to be looking for, and you’re supposed to demonstrate the vulnerability. It’s a bit like competing in a CTF challenge: try to find the flag you can only get by exploiting a vulnerability that’s placed at a certain location. In fact, we built a benchmark set of such tasks, and packaged them in a CTF-like style so we could easily repeat, scale, and assess our “solver agent’s” performance on it. The original set has, sadly, mostly outlived its usefulness because our solver agent is just too good on it by now, but we harvested more challenging examples from open source projects we ran on. The Agent’s Task On such a CTF-like c...

First seen: 2025-07-21 01:34

Last seen: 2025-07-21 14:37