Given the proliferation of reasoning models, we wanted to go beyond knowledge-based benchmarks to test reasoning abilities such as pattern recognition, lateral thinking, abstraction, contextual reasoning (accounting for British cultural references), and multi-step inference.In addition to reasoning, we aimed to assess how effectively models make decisions when presented with judgment calls—such as choosing between making an educated guess based on available clues or calling a function to retrieve additional information. This capability is crucial for building multi-agent orchestration systems.Another objective was to measure improvements in the latest GPT-5 models, particularly using the reasoning effort and verbosity parameters, comparing them to previous iterations and evaluating their token and reasoning time efficiency.What is Only Connect?Only Connect tests contestants' ability to identify connections between seemingly unrelated clues. It prioritizes lateral thinking, pattern recognition, and creative problem-solving over quick recall. The game consists of four rounds: Connections: Players identify the common thread linking 1-4 clues Sequences: Players predict the fourth element in a sequence after seeing 1-3 clues Wall: Players group 16 elements into four categories (similar to the NYT Connections game) Missing Vowels: Players reconstruct phrases with removed vowels and spaces using cryptic clues Given its emphasis on clever reasoning rather than knowledge recall, Only Connect provides an ideal challenge for benchmarking LLMs' reasoning capabilities. We also wanted to track performance improvements across successive model generations.MethodologyWe selected models for analysis including GPT-3, GPT-4-Mini, GPT-4.1, Claude Sonnet 4, Opus 4, and Opus 4.1, along with GPT-5 using eight different parameter configurations (low/high verbosity and minimal/low/medium/high reasoning).Questions were sourced from the Only Connect game show, following official rules. For rou...
First seen: 2025-08-13 15:04
Last seen: 2025-08-13 15:04