Related papers: The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

URL: http://arxiv.org/abs/2509.09677v2
Date: Sun, 28 Sep 2025 13:00:13 GMT
Title: The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
Authors: Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, Jonas Geiping,
Abstract summary: We show that even marginal gains in single-step accuracy can compound into exponential improvements in the length of tasks a model can successfully complete.<n>We argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason.
Score: 39.5095344448076
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Does continued scaling of large language models (LLMs) yield diminishing returns? In this work, we show that short-task benchmarks may give an illusion of slowing progress, as even marginal gains in single-step accuracy can compound into exponential improvements in the length of tasks a model can successfully complete. Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason. So, we propose isolating execution capability, by explicitly providing the knowledge and plan needed to solve a long-horizon task. First, we find that larger models can correctly execute significantly more turns even when small models have near-perfect single-turn accuracy. We then observe that the per-step accuracy of models degrades as the number of steps increases. This is not just due to long-context limitations -- curiously, we observe a self-conditioning effect -- models become more likely to make mistakes when the context contains their errors from prior turns. Self-conditioning does not reduce by just scaling the model size. But, we find that thinking mitigates self-conditioning, and also enables execution of much longer tasks in a single turn. We conclude by benchmarking frontier thinking models on the length of tasks they can execute in a single turn. Overall, by focusing on the ability to execute, we hope to reconcile debates on how LLMs can solve complex reasoning problems yet fail at simple tasks when made longer, and highlight the massive benefits of scaling model size and sequential test-time compute for long-horizon tasks.

Related papers

LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations [5.275682987885503]
We train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks.<n>We show that models encode a model-specific notion of difficulty that is distinct from human difficulty.<n>We demonstrate that routing queries across a pool of models can exceed the best-performing model.
arXiv Detail & Related papers (2026-02-10T15:57:00Z)
The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity? [53.15349353876531]
As AI becomes more capable, we entrust it with more general and consequential tasks.<n>We operationalize this question using a bias-variance decomposition of the errors made by AI models.<n>As more capable AIs pursue harder tasks, requiring more sequential action and thought, our results predict failures to be accompanied by more incoherent behavior.
arXiv Detail & Related papers (2026-01-30T14:52:03Z)
Frontier LLMs Still Struggle with Simple Reasoning Tasks [53.497499123166804]
This work studies the performance of frontier language models on a broad set of "easy" reasoning problems.<n>We create a suite of procedurally generated simple reasoning tasks, including counting, first-order logic, proof trees, and travel planning.<n>We show that even state-of-the-art thinking models consistently fail on such problems and for similar reasons.
arXiv Detail & Related papers (2025-07-09T22:22:49Z)
Overclocking LLM Reasoning: Monitoring and Controlling Thinking Path Lengths in LLMs [52.663816303997194]
A key factor influencing answer quality is the length of the thinking stage.<n>This paper explores and exploits the mechanisms by which LLMs understand and regulate the length of their reasoning.<n>Our results demonstrate that this "overclocking" method mitigates overthinking, improves answer accuracy, and reduces inference latency.
arXiv Detail & Related papers (2025-06-08T17:54:33Z)
Incentivizing LLMs to Self-Verify Their Answers [20.2584779107763]
Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks.<n>We propose a framework that incentivizes LLMs to self-verify their own answers.<n>We train our self-verification models based on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B.
arXiv Detail & Related papers (2025-06-02T06:54:29Z)
Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead [33.011660907969706]
Inference-time scaling can enhance the reasoning capabilities of large language models.<n>We investigate the benefits and limitations of scaling methods across nine state-of-the-art models and eight challenging tasks.
arXiv Detail & Related papers (2025-03-31T23:40:28Z)
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.49074603075032]
Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks.<n>We explore whether scaling with longer CoTs can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.
arXiv Detail & Related papers (2025-02-25T10:48:05Z)
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning [98.3430004984531]
We propose Length-Harmonizing Fine-Tuning (O1-Pruner) to minimize reasoning overhead while maintaining accuracy.<n>Our code is coming soon at https://github.com/StarDewXXX/O1-Pruner.
arXiv Detail & Related papers (2025-01-22T01:35:11Z)
First-Step Advantage: Importance of Starting Right in Multi-Step Math Reasoning [11.75364271481855]
Language models can solve complex reasoning tasks better by learning to generate rationales for their predictions. We observe that smaller models in particular when corrected, can solve a task that they would have otherwise struggled with. We propose QuestCoT, where a smaller model first asks itself how to start, before proceeding with a chain of reasoning.
arXiv Detail & Related papers (2023-11-14T06:45:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.