Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models
- URL: http://arxiv.org/abs/2508.03363v2
- Date: Wed, 06 Aug 2025 03:38:01 GMT
- Title: Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models
- Authors: Haotian Wu, Bo Xu, Yao Shu, Menglin Yang, Chengwei Qin,
- Abstract summary: We propose Thinking with Nothinking (JointThinking) as an in-context learning (ICL) paradigm for Reasoning large language models (RLLMs)<n>Our method prompts the model to generate two answers in parallel: one in Thinking mode and the other in Nothinking mode.<n>JointThinking significantly outperforms few-shot chain-of-thought robustness (CoT) and majority voting with improved answer.
- Score: 23.642200042199484
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reasoning large language models (RLLMs) have recently demonstrated remarkable capabilities through structured and multi-step reasoning. While prior research has primarily focused on improving their training and inference strategies, their potential for in-context learning (ICL) remains largely underexplored. To fill this gap, we propose Thinking with Nothinking Calibration (JointThinking), a new ICL paradigm that leverages the structured difference between two reasoning modes, i.e., Thinking and Nothinking, to improve reasoning accuracy. Specifically, our method prompts the model to generate two answers in parallel: one in Thinking mode and the other in Nothinking mode. A second round of Thinking is triggered only when the two initial responses are inconsistent, using a single prompt that incorporates the original question and both candidate answers. Since such disagreement occurs infrequently (e.g., only 6\% in GSM8K), our method performs just one round of reasoning in most cases, resulting in minimal latency overhead. Extensive experiments across multiple reasoning benchmarks demonstrate that JointThinking significantly outperforms few-shot chain-of-thought (CoT) and majority voting with improved answer robustness. Moreover, It achieves comparable in-distribution performance to training-based SOTA method, while substantially outperforming on out-of-distribution tasks. We further conduct a systematic analysis of the calibration mechanism, showing that leveraging different reasoning modes consistently lowers the error rate and highlights the value of structural thinking diversity. Additionally, we observe that the performance gap between actual and ideal reasoning narrows as model size increases in the second round of thinking, indicating the strong scalability of our approach. Finally, we discuss current limitations and outline promising directions for future ICL research in RLLMs.
Related papers
- Lost at the Beginning of Reasoning [82.18834329384514]
We show that the first reasoning step exerts a disproportionately large influence on the final prediction.<n>We propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps.<n>We introduce a new benchmark specifically constructed with deliberately flawed first reasoning steps to systematically evaluate model self-correction capabilities.
arXiv Detail & Related papers (2025-06-27T09:53:57Z) - Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models [103.03315678501546]
Extending thinking traces using prompts like "Wait" or "Let me rethink" can improve performance.<n>This raises a natural question: Does thinking more at test-time truly lead to better reasoning?<n>We show a consistent pattern of initial performance improvements from additional thinking followed by a decline, due to "overthinking"
arXiv Detail & Related papers (2025-06-04T17:55:09Z) - GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking [35.14983424309319]
We present GThinker, a novel reasoning MLLM excelling in multimodal reasoning across general scenarios, mathematics, and science.<n>GThinker introduces Cue-Rethinking, a flexible reasoning pattern that grounds inferences in visual cues and iteratively reinterprets these cues to resolve inconsistencies.<n>To support the training, we construct GThinker-11K, comprising 7K high-quality, iteratively-annotated reasoning paths and 4K curated reinforcement learning samples.
arXiv Detail & Related papers (2025-06-01T16:28:26Z) - Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning [10.255235456427037]
We propose a simple yet effective two-stage reinforcement learning framework for achieving concise reasoning in Large Language Models (LLMs)<n>The first stage, using more training steps, aims to incentivize the model's reasoning capabilities via Group Relative Policy Optimization.<n>The second stage, using fewer training steps, explicitly enforces conciseness and improves efficiency via Length-aware Group Relative Policy Optimization.
arXiv Detail & Related papers (2025-05-27T13:29:51Z) - Let LLMs Break Free from Overthinking via Self-Braking Tuning [60.08396797526657]
Large reasoning models (LRMs) have significantly enhanced their reasoning capabilities by generating longer chains of thought.<n>This performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process.<n>We propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process.
arXiv Detail & Related papers (2025-05-20T16:53:40Z) - Think Deep, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods [39.89239733570008]
This work conducts a comprehensive analysis of inference-time scaling methods for both reasoning and non-reasoning models.<n>We find that non-reasoning models, even with an extremely high inference budget, still fall substantially behind reasoning models.<n>For reasoning models, majority voting proves to be a robust inference strategy, generally competitive or outperforming other more sophisticated ITC methods.
arXiv Detail & Related papers (2025-04-18T19:32:55Z) - Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs [86.79757571440082]
Large language models (LLMs) such as OpenAI's o1 have demonstrated remarkable abilities in complex reasoning tasks.<n>We identify a phenomenon we term underthinking, where o1-like LLMs frequently switch between different reasoning thoughts.<n>We propose a decoding strategy with thought switching penalty TIP that discourages premature transitions between thoughts.
arXiv Detail & Related papers (2025-01-30T18:58:18Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training [49.3242278912771]
Multimodal reasoning is a challenging task that requires models to reason across multiple modalities to answer questions.
Existing approaches have made progress by incorporating language and visual modalities into a two-stage reasoning framework.
We propose MC-CoT, a self-consistency training strategy that generates multiple rationales and answers, subsequently selecting the most accurate through a voting process.
arXiv Detail & Related papers (2023-11-23T17:09:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.