OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
- URL: http://arxiv.org/abs/2508.13141v2
- Date: Sat, 04 Oct 2025 14:25:22 GMT
- Title: OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
- Authors: Pranjal Aggarwal, Seungone Kim, Jack Lanchantin, Sean Welleck, Jason Weston, Ilia Kulikov, Swarnadeep Saha,
- Abstract summary: Thinking LLMs solve complex tasks at the expense of increased compute and overthinking on simpler problems.<n>Non-thinking LLMs are faster and cheaper but underthink on harder reasoning problems.<n>We introduce OptimalThinkingBench, a unified benchmark that jointly evaluates overthinking and underthinking in LLMs.
- Score: 61.90251858867122
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Thinking LLMs solve complex tasks at the expense of increased compute and overthinking on simpler problems, while non-thinking LLMs are faster and cheaper but underthink on harder reasoning problems. This has led to the development of separate thinking and non-thinking LLM variants, leaving the onus of selecting the optimal model for each query on the end user. We introduce OptimalThinkingBench, a unified benchmark that jointly evaluates overthinking and underthinking in LLMs and also encourages the development of optimally-thinking models that balance performance and efficiency. Our benchmark comprises two sub-benchmarks: OverthinkingBench, featuring simple math and general queries in 72 domains, and UnderthinkingBench, containing 11 challenging reasoning tasks along with harder math problems. Using novel thinking-adjusted accuracy metrics, we extensively evaluate 33 different thinking and non-thinking models and show that no model is able to optimally think on our benchmark. Thinking models often overthink for hundreds of tokens on the simplest user queries without improving performance. In contrast, large non-thinking models underthink, often falling short of much smaller thinking models. We further explore several methods to encourage optimal thinking, but find that these approaches often improve on one sub-benchmark at the expense of the other, highlighting the need for better unified and optimal models in the future.
Related papers
- The Virtues of Brevity: Avoid Overthinking in Parallel Test-Time Reasoning [0.7874708385247352]
We show that the simple and counterintuitive of selecting the shortest solution is highly effective.<n>We confirm that this approach is competitive with complex methods such as self-consistency.
arXiv Detail & Related papers (2025-10-24T00:47:17Z) - SmartSwitch: Advancing LLM Reasoning by Overcoming Underthinking via Promoting Deeper Thought Exploration [49.290631188365786]
Long chain-of-thought (LongCoT) is central to the recent breakthroughs achieved by large language models in complex reasoning tasks.<n>We propose a simple yet effective reasoning strategy: the SmartSwitch inference framework.<n>This framework can be easily integrated into any large language model as a plug-and-play solution.
arXiv Detail & Related papers (2025-10-22T16:56:01Z) - Do LLMs Really Need 10+ Thoughts for "Find the Time 1000 Days Later"? Towards Structural Understanding of LLM Overthinking [46.43570276604168]
Long chain-of-thought (CoT) models often engage in unnecessarily extensive reasoning even for simple queries.<n>This study introduces a systematic, fine-grained analyzer of LLMs' thought process to bridge the gap, TRACE.<n>We propose a utility-based definition of overthinking, which moves beyond length-based metrics.
arXiv Detail & Related papers (2025-10-09T07:33:25Z) - Gold-Switch: Training-Free Superposition of Slow- and Fast- Thinking LLMs [36.84838904299283]
Large Reasoning Models (LRMs) excel in structured tasks by emulating deliberate human reasoning but often suffer from overthinking.<n>We propose a superposed deployment strategy with a lightweight, training-free regulation to optimize switching inference by one model on and off.
arXiv Detail & Related papers (2025-10-08T08:17:57Z) - Explore Briefly, Then Decide: Mitigating LLM Overthinking via Cumulative Entropy Regulation [82.62935304152239]
Large Language Models (LLMs) have demonstrated remarkable reasoning abilities on complex problems using long Chain-of-Thought (CoT) reasoning.<n>They often suffer from overthinking, meaning generating unnecessarily lengthy reasoning steps for simpler problems.<n>We introduce a novel metric Token Entropy Cumulative Average (TECA), which measures the extent of exploration throughout the reasoning process.
arXiv Detail & Related papers (2025-10-02T17:36:50Z) - Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models [23.642200042199484]
We propose Thinking with Nothinking (JointThinking) as an in-context learning (ICL) paradigm for Reasoning large language models (RLLMs)<n>Our method prompts the model to generate two answers in parallel: one in Thinking mode and the other in Nothinking mode.<n>JointThinking significantly outperforms few-shot chain-of-thought robustness (CoT) and majority voting with improved answer.
arXiv Detail & Related papers (2025-08-05T12:09:55Z) - Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models [103.03315678501546]
Extending thinking traces using prompts like "Wait" or "Let me rethink" can improve performance.<n>This raises a natural question: Does thinking more at test-time truly lead to better reasoning?<n>We show a consistent pattern of initial performance improvements from additional thinking followed by a decline, due to "overthinking"
arXiv Detail & Related papers (2025-06-04T17:55:09Z) - Let LLMs Break Free from Overthinking via Self-Braking Tuning [60.08396797526657]
Large reasoning models (LRMs) have significantly enhanced their reasoning capabilities by generating longer chains of thought.<n>This performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process.<n>We propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process.
arXiv Detail & Related papers (2025-05-20T16:53:40Z) - AdaptThink: Reasoning Models Can Learn When to Think [42.77877234302026]
We propose AdaptThink, a novel RL algorithm to teach reasoning models to choose the optimal thinking mode adaptively based on problem difficulty.<n>Our experiments indicate that AdaptThink significantly reduces the inference costs while further enhancing performance.
arXiv Detail & Related papers (2025-05-19T17:50:52Z) - Thinkless: LLM Learns When to Think [57.857534644932194]
Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference.<n>We propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning.<n>On several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%.
arXiv Detail & Related papers (2025-05-19T17:24:16Z) - Reasoning Models Can Be Effective Without Thinking [45.411955744222524]
We find that bypassing the thinking process via simple prompting, denoted as NoThinking, can be surprisingly effective.<n>Our method outperforms a range of baselines with similar latency using Thinking, and is comparable to Thinking with significantly longer latency (up to 9x)
arXiv Detail & Related papers (2025-04-14T04:08:16Z) - Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs [86.79757571440082]
Large language models (LLMs) such as OpenAI's o1 have demonstrated remarkable abilities in complex reasoning tasks.<n>We identify a phenomenon we term underthinking, where o1-like LLMs frequently switch between different reasoning thoughts.<n>We propose a decoding strategy with thought switching penalty TIP that discourages premature transitions between thoughts.
arXiv Detail & Related papers (2025-01-30T18:58:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.