OckBench: Measuring the Efficiency of LLM Reasoning
- URL: http://arxiv.org/abs/2511.05722v1
- Date: Fri, 07 Nov 2025 21:29:41 GMT
- Title: OckBench: Measuring the Efficiency of LLM Reasoning
- Authors: Zheng Du, Hao Kang, Song Han, Tushar Krishna, Ligeng Zhu,
- Abstract summary: We introduce OckBench, a benchmark that evaluates both accuracy and token count for reasoning and coding tasks.<n>We show that many models with comparable accuracy differ wildly in token consumption.<n>OckBench provides a unified platform for measuring, comparing, and guiding research in token-efficient reasoning.
- Score: 19.06128472840761
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models such as GPT-4, Claude 3, and the Gemini series have improved automated reasoning and code generation. However, existing benchmarks mainly focus on accuracy and output quality, and they ignore an important factor: decoding token efficiency. In real systems, generating 10,000 tokens versus 100,000 tokens leads to large differences in latency, cost, and energy. In this work, we introduce OckBench, a model-agnostic and hardware-agnostic benchmark that evaluates both accuracy and token count for reasoning and coding tasks. Through experiments comparing multiple open- and closed-source models, we uncover that many models with comparable accuracy differ wildly in token consumption, revealing that efficiency variance is a neglected but significant axis of differentiation. We further demonstrate Pareto frontiers over the accuracy-efficiency plane and argue for an evaluation paradigm shift: we should no longer treat tokens as "free" to multiply. OckBench provides a unified platform for measuring, comparing, and guiding research in token-efficient reasoning. Our benchmarks are available at https://ockbench.github.io/ .
Related papers
- Decomposing Reasoning Efficiency in Large Language Models [2.4149105714758545]
We decompose token efficiency into interpretable factors: completion under a fixed token budget, conditional correctness given completion, and verbosity.<n>When reasoning traces are available, we add deterministic trace-quality measures to separate looping from verbose-but-engaged reasoning.<n>Our decomposition reveals distinct bottleneck profiles that suggest different efficiency interventions.
arXiv Detail & Related papers (2026-02-10T14:09:18Z) - AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms [54.99368693313797]
Existing benchmarks test only individual languages/tools, so the performance numbers are not directly comparable.<n>We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean.
arXiv Detail & Related papers (2026-02-10T06:58:26Z) - TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs [57.217593337454026]
TokenSqueeze is a novel Long2Short method that condenses reasoning paths while preserving performance and relying exclusively on self-generated data.<n>We show that TokenSqueeze reduces token usage while maintaining accuracy on the MATH500 benchmark.
arXiv Detail & Related papers (2025-11-17T10:38:56Z) - Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks [34.09939383415074]
Benchmark Profiling decomposes benchmark performance into ten cognitively grounded abilities.<n>It explains why performance gains do not always translate into user-perceived competence.
arXiv Detail & Related papers (2025-09-23T15:32:47Z) - Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference [31.2331188304598]
Changing system configuration, such as evaluation batch size, GPU count, and GPU version, can introduce significant differences in the generated responses.<n>We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision.<n>Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32.
arXiv Detail & Related papers (2025-06-11T08:23:53Z) - AutoJudge: Judge Decoding Without Manual Annotation [13.451750613294054]
AutoJudge is a method that accelerates large language model (LLM) inference with task-specific lossy speculative decoding.<n>Our approach relies on a semi-greedy search algorithm to test which of the mismatches between target and draft models should be corrected.
arXiv Detail & Related papers (2025-04-28T17:59:28Z) - THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models [65.39456695678713]
We introduce approximate measures of problem-level difficulty and demonstrate that a clear relationship between problem difficulty and optimal token spend exists.<n>We find that in general, reasoning models are poorly calibrated, particularly on easy problems.<n>We introduce THOUGHTTERMINATOR, a training-free black box decoding technique that significantly improves reasoning model calibration.
arXiv Detail & Related papers (2025-04-17T22:16:30Z) - SEAL: Steerable Reasoning Calibration of Large Language Models for Free [58.931194824519935]
Large Language Models (LLMs) have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism.<n>Recent studies reveal substantial redundancy in the CoT reasoning traces, which negatively impacts model performance.<n>We introduce SEAL, a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains.
arXiv Detail & Related papers (2025-04-07T02:42:07Z) - Do Large Language Model Benchmarks Test Reliability? [66.1783478365998]
We investigate how well current benchmarks quantify model reliability.<n>Motivated by this gap in the evaluation of reliability, we propose the concept of so-called platinum benchmarks.<n>We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks.
arXiv Detail & Related papers (2025-02-05T18:58:19Z) - Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [53.51560766150442]
Critical tokens are elements within reasoning trajectories that significantly influence incorrect outcomes.<n>We present a novel framework for identifying these tokens through rollout sampling.<n>We show that identifying and replacing critical tokens significantly improves model accuracy.
arXiv Detail & Related papers (2024-11-29T18:58:22Z) - Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully
Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost.
Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.