Related papers: The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

URL: http://arxiv.org/abs/2602.17831v1
Date: Thu, 19 Feb 2026 20:49:15 GMT
Title: The Token Games: Evaluating Language Model Reasoning with Puzzle Duels
Authors: Simon Henniger, Gabriel Poesia,
Abstract summary: We take inspiration from 16th-century mathematical duels to The Token Games (TTG): an evaluation framework where models challenge each other by creating puzzles.<n>Using results from pairwise duels, we then compute Elo ratings, allowing us to compare models relative to each other.<n>We evaluate 10 frontier models on TTG, and closely match the ranking from existing benchmarks such as Humanity's Last Exam, without involving any human effort in creating puzzles.
Score: 6.179868854898031
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Evaluating the reasoning capabilities of Large Language Models is increasingly challenging as models improve. Human curation of hard questions is highly expensive, especially in recent benchmarks using PhD-level domain knowledge to challenge the most capable models. Even then, there is always a concern about whether these questions test genuine reasoning or if similar problems have been seen during training. Here, we take inspiration from 16th-century mathematical duels to design The Token Games (TTG): an evaluation framework where models challenge each other by creating their own puzzles. We leverage the format of Programming Puzzles - given a Python function that returns a boolean, find inputs that make it return True - to flexibly represent problems and enable verifying solutions. Using results from pairwise duels, we then compute Elo ratings, allowing us to compare models relative to each other. We evaluate 10 frontier models on TTG, and closely match the ranking from existing benchmarks such as Humanity's Last Exam, without involving any human effort in creating puzzles. We also find that creating good puzzles is still a highly challenging task for current models, not measured by previous benchmarks. Overall, our work suggests new paradigms for evaluating reasoning that cannot be saturated by design, and that allow testing models for other skills like creativity and task creation alongside problem solving.

Related papers

Evaluating Language Models' Evaluations of Games [65.49017696754825]
We advocate for a new paradigm that assesses AI systems' evaluation of games.<n>We leverage a large-scale dataset of over $100$ novel board games and over 450 human judgments.<n>Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models.
arXiv Detail & Related papers (2025-10-13T02:45:37Z)
UQ: Assessing Language Models on Unsolved Questions [149.46593270027697]
We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange.<n>UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers.<n>The top model passes UQ-validation on only 15% of questions, and preliminary human verification has already identified correct answers.
arXiv Detail & Related papers (2025-08-25T01:07:59Z)
Self-Questioning Language Models [58.73276539661649]
We propose an asymmetric self-play framework where a proposer is given the topic and generates a question for a solver.<n>Both the proposer and solver are trained via reinforcement learning.<n>We study this asymmetric self-play framework on three benchmarks: three-digit multiplication, algebra problems from the OMEGA benchmark, and programming problems from Codeforces.
arXiv Detail & Related papers (2025-08-05T17:51:33Z)
Frontier LLMs Still Struggle with Simple Reasoning Tasks [53.497499123166804]
This work studies the performance of frontier language models on a broad set of "easy" reasoning problems.<n>We create a suite of procedurally generated simple reasoning tasks, including counting, first-order logic, proof trees, and travel planning.<n>We show that even state-of-the-art thinking models consistently fail on such problems and for similar reasons.
arXiv Detail & Related papers (2025-07-09T22:22:49Z)
PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts [47.92619068073141]
We introduce PuzzleWorld, a large-scale benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning.<n>Most state-of-the-art models achieve only 1-2% final answer accuracy, with the best model solving only 14% of puzzles and reaching 40% stepwise accuracy.<n>Our error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning.
arXiv Detail & Related papers (2025-06-06T16:17:09Z)
EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges [17.056693711040747]
We introduce EnigmaEval, a dataset of problems and solutions derived from puzzle competitions and events.<n>This dataset probes models' ability to perform implicit knowledge synthesis and multi-step deductive reasoning.<n>The benchmark comprises 1184 puzzles of varying complexity requiring teams of skilled solvers hours to days to complete.
arXiv Detail & Related papers (2025-02-13T00:18:34Z)
BRAINTEASER: Lateral Thinking Puzzles for Large Language Models [15.95314613982879]
BRAINTEASER is a multiple-choice Question Answering task designed to test the model's ability to exhibit lateral thinking. Our experiments with state-of-the-art instruction- and commonsense language models reveal a significant gap between human and model performance. We make all of our code and data available to stimulate work on developing and evaluating lateral thinking models.
arXiv Detail & Related papers (2023-10-08T07:46:01Z)
Solving and Generating NPR Sunday Puzzles with Large Language Models [0.0]
State-of-the-art large language models can solve many PUZZLEQA puzzles. The best model achieves, GPT-3.5, 50.2% loose accuracy. GPT-3.5 generates puzzles with answers that do not conform to the generated rules.
arXiv Detail & Related papers (2023-06-21T13:23:48Z)
Are Deep Neural Networks SMARTer than Second Graders? [85.60342335636341]
We evaluate the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed for children in the 6--8 age group. Our dataset consists of 101 unique puzzles; each puzzle comprises a picture question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning. Experiments reveal that while powerful deep models offer reasonable performances on puzzles in a supervised setting, they are not better than random accuracy when analyzed for generalization.
arXiv Detail & Related papers (2022-12-20T04:33:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.