Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length?
- URL: http://arxiv.org/abs/2504.01935v1
- Date: Wed, 02 Apr 2025 17:45:58 GMT
- Title: Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length?
- Authors: Celine Lee, Alexander M. Rush, Keyon Vafa,
- Abstract summary: We formalize a framework using deterministic finite automata (DFAs)<n>We show that there exists an optimal amount of reasoning tokens such that the probability of producing a correct solution is maximized.<n>We then demonstrate an implication of these findings: being able to predict the optimal number of reasoning tokens for new problems and filtering out non-optimal length answers results in consistent accuracy improvements.
- Score: 72.70486097967124
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) often benefit from verbalized reasoning at inference time, but it remains unclear which aspects of task difficulty these extra reasoning tokens address. To investigate this question, we formalize a framework using deterministic finite automata (DFAs). DFAs offer a formalism through which we can characterize task complexity through measurable properties such as run length (number of reasoning steps required) and state-space size (decision complexity). We first show that across different tasks and models of different sizes and training paradigms, there exists an optimal amount of reasoning tokens such that the probability of producing a correct solution is maximized. We then investigate which properties of complexity govern this critical length: we find that task instances with longer corresponding underlying DFA runs (i.e. demand greater latent state-tracking requirements) correlate with longer reasoning lengths, but, surprisingly, that DFA size (i.e. state-space complexity) does not. We then demonstrate an implication of these findings: being able to predict the optimal number of reasoning tokens for new problems and filtering out non-optimal length answers results in consistent accuracy improvements.
Related papers
- Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs [52.405085773954596]
We find that large language models (LLMs) tend to overthink simple problems, generating unnecessarily long outputs, and underthink harder ones.
This indicates that models might misjudge problem difficulty and fail to calibrate their response length appropriately.
Experiments show that the generation length can be significantly reduced while maintaining acceptable accuracy.
arXiv Detail & Related papers (2025-04-30T18:48:06Z) - DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models [31.189242663680695]
This paper introduces Difficulty-Adaptive Slow-Thinking (DAST), a novel framework that enables models to autonomously adjust the length of Chain-of-Thought(CoT) based on problem difficulty.<n>Experiments on diverse datasets and model scales demonstrate that DAST effectively mitigates overthinking while preserving reasoning accuracy on complex problems.
arXiv Detail & Related papers (2025-03-06T14:23:06Z) - How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach [4.055489363682199]
We conduct the first systematic study of the relationship between reasoning length and model performance.<n>We show that this tradeoff persists across even very distinct reasoning chains.<n>We show that prompt-based compression strategies operate far from theoretical limits.
arXiv Detail & Related papers (2025-03-03T03:48:20Z) - When More is Less: Understanding Chain-of-Thought Length in LLMs [53.77747102201451]
Chain-of-thought (CoT) reasoning enhances the multi-step reasoning capabilities of large language models (LLMs)<n>However, for most models and tasks, does an increase in CoT length consistently lead to improved reasoning accuracy?<n>In this paper, we observe a nuanced relationship: as the number of reasoning steps increases, performance initially improves but eventually decreases.
arXiv Detail & Related papers (2025-02-11T05:28:59Z) - FLARE: Faithful Logic-Aided Reasoning and Exploration [50.9814063216852]
We introduce a novel approach for traversing the problem space using task decompositions.<n>We use the Large Language Models to plan a solution, soft-formalise the query into facts and predicates using a logic programming code.<n>Our method allows us to compute the faithfulness of the reasoning process w.r.t. the generated code and analyse the steps of the multi-hop search without relying on external solvers.
arXiv Detail & Related papers (2024-10-14T19:39:11Z) - DCR: Divide-and-Conquer Reasoning for Multi-choice Question Answering with LLMs [9.561022942046279]
We propose Divide and Conquer Reasoning (DCR) to enhance the reasoning capability of large language models (LLMs)
We first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers.
In particular, we first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers.
arXiv Detail & Related papers (2024-01-10T14:38:46Z) - The Impact of Reasoning Step Length on Large Language Models [40.546685248243534]
Chain of Thought (CoT) is significant in improving the reasoning abilities of large language models.
We investigate the correlation between the effectiveness of CoT and the length of reasoning steps in prompts.
arXiv Detail & Related papers (2024-01-10T04:37:38Z) - Complexity-Based Prompting for Multi-Step Reasoning [72.0057198610614]
We study the task of prompting large-scale language models to perform multi-step reasoning.
A central question is which reasoning examples make the most effective prompts.
We propose complexity-based prompting, a simple and effective example selection scheme for multi-step reasoning.
arXiv Detail & Related papers (2022-10-03T05:33:27Z) - From Checking to Inference: Actual Causality Computations as
Optimization Problems [79.87179017975235]
We present a novel approach to formulate different notions of causal reasoning, over binary acyclic models, as optimization problems.
We show that both notions are efficiently automated. Using models with more than $8000$ variables, checking is computed in a matter of seconds, with MaxSAT outperforming ILP in many cases.
arXiv Detail & Related papers (2020-06-05T10:56:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.