Related papers: The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer

The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer

URL: http://arxiv.org/abs/2502.15631v1
Date: Fri, 21 Feb 2025 17:59:13 GMT
Title: The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer
Authors: Marthe Ballon, Andres Algaba, Vincent Ginis,
Abstract summary: We analyze chain-of-thought length across o1-mini and o3-mini variants on the Omni-MATH benchmark.<n>We find that o3-mini (m) achieves superior accuracy without requiring longer reasoning chains than o1-mini.<n>This accuracy drop is significantly smaller in more proficient models, suggesting that new generations of reasoning models use test-time compute more effectively.
Score: 1.474723404975345
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models have demonstrated remarkable progress in mathematical reasoning, leveraging chain-of-thought and test-time compute scaling. However, many open questions remain regarding the interplay between reasoning token usage and accuracy gains. In particular, when comparing models across generations, it is unclear whether improved performance results from longer reasoning chains or more efficient reasoning. We systematically analyze chain-of-thought length across o1-mini and o3-mini variants on the Omni-MATH benchmark, finding that o3-mini (m) achieves superior accuracy without requiring longer reasoning chains than o1-mini. Moreover, we show that accuracy generally declines as reasoning chains grow across all models and compute settings, even when controlling for difficulty of the questions. This accuracy drop is significantly smaller in more proficient models, suggesting that new generations of reasoning models use test-time compute more effectively. Finally, we highlight that while o3-mini (h) achieves a marginal accuracy gain over o3-mini (m), it does so by allocating substantially more reasoning tokens across all problems, even the ones that o3-mini (m) can already solve. These findings provide new insights into the relationship between model capability and reasoning length, with implications for efficiency, scaling, and evaluation methodologies.

Related papers

Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs [52.405085773954596]
We find that large language models (LLMs) tend to overthink simple problems, generating unnecessarily long outputs, and underthink harder ones. This indicates that models might misjudge problem difficulty and fail to calibrate their response length appropriately. Experiments show that the generation length can be significantly reduced while maintaining acceptable accuracy.
arXiv Detail & Related papers (2025-04-30T18:48:06Z)
ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning [1.170732359523702]
Reasoning models such as OpenAI o3 and DeepSeek-R1 have demonstrated strong performance on reasoning-intensive tasks. Long reasoning traces can facilitate a more thorough exploration of solution paths for complex problems. We introduce ShorterBetter, a simple yet effective reinforcement learning methed that enables reasoning language models to discover their own optimal CoT lengths.
arXiv Detail & Related papers (2025-04-30T07:04:19Z)
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models [72.75501495786297]
We introduce a novel hybrid linear RNN reasoning model, M1, built on the Mamba architecture. Experimental results show that M1 not only outperforms previous linear RNN models but also matches the performance of state-of-the-art DeepSeek R1 distilled reasoning models.
arXiv Detail & Related papers (2025-04-14T17:38:25Z)
ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models [16.407923457296235]
This work investigates how reasoning length is embedded in the hidden representations of reasoning models. We introduce ThinkEdit, a simple yet effective weight-editing approach to mitigate the issue of overly short reasoning.
arXiv Detail & Related papers (2025-03-27T23:53:45Z)
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.49074603075032]
Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks. We explore whether scaling with longer CoTs can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.
arXiv Detail & Related papers (2025-02-25T10:48:05Z)
Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps [3.8936716676293917]
This study investigates the in-context learning capabilities of various decoder-only transformer-based language models with different model sizes and training data.<n>We identify a critical parameter threshold (1.6 billion), beyond which reasoning performance improves significantly in tasks such as commonsense reasoning in multiple-choice question answering and deductive reasoning.
arXiv Detail & Related papers (2025-02-21T00:48:32Z)
Small Models Struggle to Learn from Strong Reasoners [14.895026967556088]
Small models do not consistently benefit from long chain-of-thought reasoning or distillation from larger models.<n>We propose Mix Distillation, a strategy that balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models.<n>Our experiments demonstrate that Mix Distillation significantly improves small model reasoning performance compared to training on either data alone.
arXiv Detail & Related papers (2025-02-17T18:56:15Z)
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [61.85289698610747]
We study whether o1-like large language models (LLMs) truly possess test-time scaling capabilities.<n>We find that longer CoTs of these o1-like models do not consistently enhance accuracy.<n>We propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics.
arXiv Detail & Related papers (2025-02-17T07:21:11Z)
When More is Less: Understanding Chain-of-Thought Length in LLMs [53.77747102201451]
Chain-of-thought (CoT) reasoning enhances the multi-step reasoning capabilities of large language models (LLMs)<n>However, for most models and tasks, does an increase in CoT length consistently lead to improved reasoning accuracy?<n>In this paper, we observe a nuanced relationship: as the number of reasoning steps increases, performance initially improves but eventually decreases.
arXiv Detail & Related papers (2025-02-11T05:28:59Z)
The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles [29.214813685163218]
OpenAI's releases of o1 and o3 mark a paradigm shift in Large Language Models towards advanced reasoning capabilities.<n>We track the evolution of the GPT-[n] and o-[n] series models on challenging multimodal puzzles.<n>The superior performance of o1 comes at nearly 750 times the computational cost of GPT-4o, raising concerns about its efficiency.
arXiv Detail & Related papers (2025-02-03T05:47:04Z)
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning [98.3430004984531]
We propose Length-Harmonizing Fine-Tuning (O1-Pruner) to minimize reasoning overhead while maintaining accuracy.<n>Our code is coming soon at https://github.com/StarDewXXX/O1-Pruner.
arXiv Detail & Related papers (2025-01-22T01:35:11Z)
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference.<n>This paper presents the first comprehensive study on the prevalent issue of overthinking in these models.<n>We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.