Elimination Game Benchmark: Social Reasoning, Strategy, and Deception in Multi-Agent LLM Dynamics The Elimination Game is a multi-player tournament that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private conversations, form alliances, and vote to eliminate each other round by round until only two remain. A jury of eliminated players then casts deciding votes to crown the winner. This benchmark goes beyond simple dialogues by creating a rich environment where models must navigate: Public vs. Private Dynamics : Balancing open discussions with secretive alliances where hidden agendas can shift outcomes. : Balancing open discussions with secretive alliances where hidden agendas can shift outcomes. Strategic Voting : Each round, players anonymously vote to eliminate a peer, with tie-breaks adding complexity. : Each round, players anonymously vote to eliminate a peer, with tie-breaks adding complexity. Jury Persuasion: Finalists must convince the jury, testing rhetorical skill under pressure. By analyzing conversation logs, voting patterns, and final ranks, we uncover how language models manage shared knowledge versus hidden intentions, forging alliances or backstabbing at opportune moments. Animation vs_detail1_output.mp4 We provide a round-by-round replay visualizing: Public Chat : Each active seat’s statement in the single public subround. : Each active seat’s statement in the single public subround. Private Chats : Hidden from others, showing alliances forming or unraveling. : Hidden from others, showing alliances forming or unraveling. Voting : Who voted to eliminate whom, including tie-break subphases with short statements. : Who voted to eliminate whom, including tie-break subphases with short statements. Jury Decision: The final two’s pleas and the jury’s decisive votes. Longer video: Visualizations & Metrics TrueSkill Leaderboard (μ ± σ) A horizontal bar chart for each model’s skill rating, sorted top to bottom by μ. Ref...
First seen: 2025-04-07 09:18
Last seen: 2025-04-08 03:22