Refact.ai is the #1 open-source AI Agent on SWE-bench Verified with a 69.8% score May 15, 2025 by Sergey Vakhreev 5 min read Refact.ai Agent achieved 69.8% on SWE-bench Verified — autonomously solving 349 out of 500 tasks. This makes Refact.ai a leading open-source AI programming Agent on SWE-bench and places it among the top ranks on the leaderboard. SWE-bench Verified is a refined version of the original SWE-bench, featuring 500 real-world GitHub issues, selected manually. It provides a more accurate and consistent way to evaluate how well AI agents can handle practical software engineering tasks. Key elements that made this possible: Extensive guardrails that step in when the model gets stuck or goes off trackdebug_script() sub-agent that uses pdb to fix bugs and can modify/create new scriptsstrategic_planning() tool powered by o3 to rethink and refine fixes when needed The full pipeline we used for SWE-bench Verified is open-source. You can implement the same components and run the benchmark just like we did — to reproduce Refact.ai Agent approach and score end-to-end. Read on to see how the Agent is built for SWE-bench, and how the same ideas power real-world workflows in Refact.ai. Model setup Orchestration model: Claude-3.7 Debug sub-agent — debug_script(): Claude-3.7 + o4-miniPlanning tool — strategic_planning(): o3pass@1: Each task is not attempted more than once.Temperature: 0 for every Claude model. For each SWE-bench Verified problem, Refact.ai Agent made one multi-step run aiming to produce a single, correct final solution. Our main goal was to achieve a maximum score in a single attempt. Simpler, more effective Agent prompt We revised the Agent prompt from our SWE-bench Lite run, where we top-ranked with a 59.7% score. Back then, it was more complex, and looking at how AI Agent behaved, we realized that simpler is better. The new version is shorter and easier to follow. Since Refact.ai is open-source, you can explore it: You are a fully autonomous agen...
First seen: 2025-05-22 11:24
Last seen: 2025-05-22 11:24