Related papers: BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models

BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models

URL: http://arxiv.org/abs/2509.24210v1
Date: Mon, 29 Sep 2025 02:49:01 GMT
Title: BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models
Authors: Gaurav Srivastava, Aafiya Hussain, Zhenyu Bi, Swastik Roy, Priya Pitre, Meng Lu, Morteza Ziyadi, Xuan Wang,
Abstract summary: We introduce BeyondBench, an evaluation framework that avoids contamination from internet-scale training data.<n>Our framework covers 44 algorithmic tasks with a total of 117 variations, grouped into three difficulty levels.<n>We evaluate 101 language models, including 85 open-source and 16 closed-source models, spanning sizes from 0.5B to 141B parameters.
Score: 13.380359214677176
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating language models fairly is becoming harder as static benchmarks available on the internet risk contamination by training data. This makes it unclear whether models are truly reasoning or just recalling answers. In this paper, we introduce BeyondBench, an evaluation framework that avoids this problem by using algorithmic problem generation. Unlike traditional benchmarks that risk contamination from internet-scale training data, BeyondBench creates mathematically grounded problems on the fly, ensuring each test remains fresh and uncontaminated. Our framework covers 44 algorithmic tasks with a total of 117 variations, grouped into three difficulty levels: the Easy Suite (29 tasks) for basic arithmetic and statistics, the Medium Suite (5 tasks, 49 variations) for sequence patterns and reasoning, and the Hard Suite (10 tasks, 68 variations) tackling NP-complete and constraint satisfaction problems. Each task generates problems from a combinatorial space larger than 10^15 unique instances, with solutions verified deterministically by mathematical proofs. We evaluated 101 language models, including 85 open-source and 16 closed-source models, spanning sizes from 0.5B to 141B parameters and multiple quantization schemes. Our results show consistent reasoning deficiencies across model families, with performance degrading sharply as problem complexity increases from polynomial to exponential. In our Hard Suite evaluations, models such as Gemini-2.5-pro, Llama-3.3-70B, and Qwen2.5-72B achieved average accuracies of 56.38%, 26.91%, and 33.60%, respectively. Moreover, we observe that performance drops drastically without tool usage, with GPT-5, GPT-5-mini, and GPT-5-nano showing a decline of 16.81%, 28.05%, and 47.59% accuracy on the hard suite. Our leaderboard is publicly available at https://ctrl-gaurav.github.io/BeyondBench/

Related papers

ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces [3.151184728006369]
We present ACAR, a measurement framework for studying multi-model orchestration under auditable conditions.<n>ACAR uses self-consistency variance (sigma) computed from N=3 probe samples to route tasks across single-model, two-model, and three-model execution modes.<n>We evaluate ACAR on 1,510 tasks spanning four benchmarks, producing more than 7,550 auditable runs.
arXiv Detail & Related papers (2026-02-06T23:27:17Z)
Scalable Generation and Validation of Isomorphic Physics Problems with GenAI [2.249733437447874]
We present a framework for generating and evaluating large-scale isomorphic physics problem banks using Generative AI.<n>Our generation framework employs prompt chaining and tool use to achieve precise control over structural variations.<n>For pre-deployment validation, we evaluate generated items using 17 open-source language models (LMs) and compare against actual student performance.
arXiv Detail & Related papers (2026-02-04T23:01:20Z)
From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics [79.81905350372067]
We study gap through contextual mathematical reasoning.<n>We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings.<n>Open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20.
arXiv Detail & Related papers (2026-01-30T14:56:04Z)
†DAGGER: Distractor-Aware Graph Generation for Executable Reasoning in Math Problems [1.2310602580215997]
Chain-of-Thought (CoT) prompting is widely adopted for mathematical problem solving, but its behavior under irrelevant context remains underexplored.<n>DisTRACTMATH-BN is a benchmark that augments MGSM and MSVAMP with semantically coherent but computationally irrelevant information.<n>DAGGER reformulates mathematical problem solving as executable computational graph generation with explicit modeling of distractor nodes.
arXiv Detail & Related papers (2026-01-11T10:51:03Z)
mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models [2.0467354053171243]
We introduce textbfmmJEE-Eval, a multimodal bilingual (English and Hindi) benchmark comprising 1,460 questions from India's Chemistry Advanced examination ( 2019-2025)<n>Our evaluation of 17 state-of-the-art models reveals that while frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve 77-84% accuracy on held-out 2025 questions, open-source models plateau at 37-45% despite scaling to 400B parameters.
arXiv Detail & Related papers (2025-11-12T13:52:37Z)
QueST: Incentivizing LLMs to Generate Difficult Problems [77.75835742350644]
Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems.<n>Existing competitive coding datasets contain only thousands to tens of thousands of problems.<n>We propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning.
arXiv Detail & Related papers (2025-10-20T16:29:53Z)
WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning [51.13280433665446]
Large language models (LLMs) excel at general mathematical reasoning but fail catastrophically on specialized technical mathematics.<n>In wireless communications, where problems require precise manipulation of information-theoretic bounds, even state-of-the-art models struggle to achieve competent performance.<n>We present WirelessMathLM, demonstrating that compact models (0.5B-7B parameters) can match or exceed much larger models through domain-specific reinforcement learning.
arXiv Detail & Related papers (2025-09-27T09:58:03Z)
ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning [51.946959481392064]
Large Reasoning Models (LRMs) have shown impressive capabilities in complex problem-solving.<n>We propose ScaleDiff, a pipeline designed to scale the creation of difficult problems.<n>We show that our pipeline can effectively transfer advanced reasoning capabilities without relying on larger, more expensive teacher models.
arXiv Detail & Related papers (2025-09-25T12:22:44Z)
Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs [19.592385109516268]
Current benchmarks for large language models (LLMs) are approaching saturation and are increasingly compromised by trainingset contamination.<n>We introduce Putnam-AXIOM, a benchmark drawn from the prestigious William Lowell Putnam Mathematical Competition.<n>The variation protocol produces an unlimited stream of equally difficult, unseen instances -- yielding contamination-resilient test bed.
arXiv Detail & Related papers (2025-08-05T17:57:50Z)
THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models [65.39456695678713]
We introduce approximate measures of problem-level difficulty and demonstrate that a clear relationship between problem difficulty and optimal token spend exists.<n>We find that in general, reasoning models are poorly calibrated, particularly on easy problems.<n>We introduce THOUGHTTERMINATOR, a training-free black box decoding technique that significantly improves reasoning model calibration.
arXiv Detail & Related papers (2025-04-17T22:16:30Z)
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [90.07275414500154]
We observe significant performance drops on MATH-P-Hard across various models.<n>We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills.
arXiv Detail & Related papers (2025-02-10T13:31:46Z)
Token-by-Token Regeneration and Domain Biases: A Benchmark of LLMs on Advanced Mathematical Problem-Solving [0.0]
This study evalu-ates 10 large language models (LLMs) with 7 to 8 billion parameters using the MATH dataset.<n>The focus is on their ability to generate executable Python code as a step in their reasoning process, involving over 9,450 code executions.
arXiv Detail & Related papers (2025-01-28T17:11:36Z)
HARP: A challenging human-annotated math reasoning benchmark [7.691786865279827]
We introduce HARP, Human Annotated Reasoning Problems (for Math), consisting of 5,409 problems from the US national math competitions (A(J)HSME, AMC, AIME, USA(J)MO).<n>Of these, 4,780 have answers that are automatically check-able (with libraries such as SymPy).<n>These problems range six difficulty levels, with frontier models performing relatively poorly on the hardest bracket of 197 problems (average accuracy 41.1% for o1-mini, and 9.6% for Gemini 1.5 Pro).<n>Our dataset also features multiple choices (for 4,110 problems) and an average of two human-written
arXiv Detail & Related papers (2024-12-11T23:31:06Z)
Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning [24.386388107656334]
We introduce Prove, a framework that leverages translated programs derived from natural language solutions as a verification mechanism.<n>Unlike vanilla majority voting, our approach filters out solutions whose corresponding program output is inconsistent with the generated solution, aggregating only those that pass verification.<n>Our results show that Prove consistently outperforms vanilla majority voting for solving mathematical reasoning tasks across all model sizes and datasets.
arXiv Detail & Related papers (2024-10-16T14:24:55Z)
Navigating the Labyrinth: Evaluating LLMs' Ability to Reason About Search Problems [62.76627483915117]
Large Language Models (LLMs) have recently achieved impressive performance in math and reasoning benchmarks.<n>We introduce a new benchmark, SearchBench, which contains 11 unique search problems inspired by intuitive puzzles.<n>We show that using step-by-step, language-only reasoning, even the most advanced LLMs fail to solve SearchBench.
arXiv Detail & Related papers (2024-06-18T00:44:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.