Related papers: Decomposing Reasoning Efficiency in Large Language Models

Decomposing Reasoning Efficiency in Large Language Models

URL: http://arxiv.org/abs/2602.09805v1
Date: Tue, 10 Feb 2026 14:09:18 GMT
Title: Decomposing Reasoning Efficiency in Large Language Models
Authors: Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud,
Abstract summary: We decompose token efficiency into interpretable factors: completion under a fixed token budget, conditional correctness given completion, and verbosity.<n>When reasoning traces are available, we add deterministic trace-quality measures to separate looping from verbose-but-engaged reasoning.<n>Our decomposition reveals distinct bottleneck profiles that suggest different efficiency interventions.
Score: 2.4149105714758545
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models trained for reasoning trade off inference tokens against accuracy, yet standard evaluations report only final accuracy, obscuring where tokens are spent or wasted. We introduce a trace-optional framework that decomposes token efficiency into interpretable factors: completion under a fixed token budget (avoiding truncation), conditional correctness given completion, and verbosity (token usage). When benchmark metadata provides per-instance workload proxies, we further factor verbosity into two components: mean verbalization overhead (tokens per work unit) and a coupling coefficient capturing how overhead scales with task workload. When reasoning traces are available, we add deterministic trace-quality measures (grounding, repetition, prompt copying) to separate degenerate looping from verbose-but-engaged reasoning, avoiding human labeling and LLM judges. Evaluating 25 models on CogniLoad, we find that accuracy and token-efficiency rankings diverge (Spearman $ρ=0.63$), efficiency gaps are often driven by conditional correctness, and verbalization overhead varies by about 9 times (only weakly related to model scale). Our decomposition reveals distinct bottleneck profiles that suggest different efficiency interventions.

Related papers

AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms [54.99368693313797]
Existing benchmarks test only individual languages/tools, so the performance numbers are not directly comparable.<n>We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean.
arXiv Detail & Related papers (2026-02-10T06:58:26Z)
Accelerate Speculative Decoding with Sparse Computation in Verification [49.74839681322316]
Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel.<n>Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding.<n>We propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost.
arXiv Detail & Related papers (2025-12-26T07:53:41Z)
Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training [11.179110411255708]
We propose a direct framework to model the scaling of benchmark performance from the training budget.<n>Our results show that the direct approach extrapolates better than the previously proposed two-stage procedure.<n>We release the complete set of pretraining losses and downstream evaluation results.
arXiv Detail & Related papers (2025-12-09T18:33:48Z)
What Does It Take to Build a Performant Selective Classifier? [30.90225954725644]
Bayes noise, approximation error, ranking error, statistical noise, and implementation- or shift-induced slack are studied.<n>We validate our decomposition on synthetic two-moons data and on real-world vision and language benchmarks.<n>Our results confirm that (i) Bayes noise and limited model capacity can account for substantial gaps, (ii) only richer, feature-aware calibrators meaningfully improve score ordering, and (iii) data shift introduces a separate slack that demands distributionally robust training.
arXiv Detail & Related papers (2025-10-23T05:48:40Z)
Do LLMs Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT-OSS-20B [1.948261185683419]
We investigate whether "evaluation scent" inflates measured performance without commensurate capability gains.<n>We run six paired A/B scenarios that hold task content and decoding fixed while varying framing.<n>We provide a reproducible A/B framework (prompt banks, validators, per-run scores, scripts) and practical guidance.
arXiv Detail & Related papers (2025-10-08T09:49:05Z)
Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations [83.93566096400723]
We find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization.<n>Character-level segmentation improves string manipulation and code understanding tasks by up to +14%.<n>Right-aligned digit grouping enhances large-number arithmetic by +33%.
arXiv Detail & Related papers (2025-06-23T18:02:26Z)
Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding [48.52389201779425]
Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in parallel.<n>Existing verification methods rely heavily on distributional consistency while overlooking semantic correctness.<n>We propose Reflective Verification, a training-free and semantics-aware approach that achieves a better trade-off between correctness and efficiency.
arXiv Detail & Related papers (2025-05-24T10:26:27Z)
Latent Veracity Inference for Identifying Errors in Stepwise Reasoning [78.29317733206643]
We introduce Veracity Search (VS), a discrete search algorithm over veracity assignments.<n>It performs otherwise intractable inference in the posterior distribution over latent veracity values.<n>It generalizes VS, enabling accurate zero-shot veracity inference in novel contexts.
arXiv Detail & Related papers (2025-05-17T04:16:36Z)
SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator [65.62084602011596]
Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks.<n>We have identified a key pattern: certain seemingly meaningless separator tokens (i.e., punctuations) contribute disproportionately to attention scores compared to semantically meaningful tokens.<n>We introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens.
arXiv Detail & Related papers (2024-12-16T18:58:57Z)
Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LMs)<n>This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts.<n>We introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution.
arXiv Detail & Related papers (2024-10-11T23:30:42Z)
SparseCL: Sparse Contrastive Learning for Contradiction Retrieval [87.02936971689817]
Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query. Existing methods such as similarity search and crossencoder models exhibit significant limitations. We introduce SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences.
arXiv Detail & Related papers (2024-06-15T21:57:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.