Related papers: Rethinking Thinking Tokens: LLMs as Improvement Operators

Rethinking Thinking Tokens: LLMs as Improvement Operators

URL: http://arxiv.org/abs/2510.01123v1
Date: Wed, 01 Oct 2025 17:08:59 GMT
Title: Rethinking Thinking Tokens: LLMs as Improvement Operators
Authors: Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhutdinov, Manzil Zaheer, Sanjeev Arora, Anirudh Goyal,
Abstract summary: Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which allows them to explore solution strategies with self-checking.<n>This results in higher accuracy, but inflates context length, token/compute cost, and answer latency.<n>We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier?<n>We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace
Score: 80.12087211785949
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which among other things, allows them to explore solution strategies with self-checking. This results in higher accuracy, but inflates context length, token/compute cost, and answer latency. We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier, e.g., better accuracy with lower context length and/or latency? Abstractly, we view the model as an improvement operator on its own "thoughts" with a continuum of possible strategies. We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace, producing an output that seeds the next round. Importantly, context length (hence compute cost) is controllable via degree of parallelism, and is no longer conflated with the total number of generated tokens. We report PDR instantiations of current models that give better accuracy than long CoT while incurring lower latency. Setting degree of parallelism to 1 yields an interesting subcase, Sequential Refinement (SR) (iteratively improve a single candidate answer) which provides performance superior to long CoT. Success of such model orchestrations raises the question whether further training could shift the Pareto frontier. To this end, we train an 8B thinking model with Reinforcement Learning (RL) to make it consistent with PDR as the inference method. On math tasks with verifiable answers, iterative pipelines surpass single-pass baselines at matched sequential budgets, with PDR delivering the largest gains (e.g., +11% on AIME 2024 and +9% on AIME 2025).

Related papers

Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching [66.39914384073145]
We propose a self-consistency framework that turns cheap diffusion-sampled reasoning into a reusable pool of step-level candidates.<n>We find that step-level recombination is most beneficial on harder problems.<n>Our training-free framework improves average accuracy by up to 2 across six math and coding tasks.
arXiv Detail & Related papers (2026-02-26T11:08:39Z)
Reasoning about Reasoning: BAPO Bounds on Chain-of-Thought Token Complexity in LLMs [16.81068280262534]
In-time scaling via chain-of-thought (CoT) reasoning is a major driver of state-of-the-art LLM performance, but it comes with substantial latency and compute costs.<n>We address a fundamental theoretical question: how many reasoning tokens are required to solve a problem as input size grows?<n>We prove lower bounds on the CoT tokens required for three canonical BAPO-hard tasks: binary majority, triplet matching, and graph reachability.
arXiv Detail & Related papers (2026-02-02T23:33:34Z)
Divide-and-Conquer CoT: RL for Reducing Latency via Parallel Reasoning [18.5812457692667]
We propose to train Divide-and-Conquer CoT (DC-CoT) to reduce the latency.<n>DC-CoT can act as a director that identifies distinct subtasks that can be performed in parallel in its reasoning process, and then spawns workers to execute the subtasks.<n>Our goal is to achieve high accuracy, with a low longest path length, which is a theoretical measure of the latency needed for the response.
arXiv Detail & Related papers (2026-01-30T14:37:07Z)
DeepPrune: Parallel Scaling without Inter-trace Redundancy [53.62015294143274]
Over 80% of parallel reasoning traces yield identical final answers, representing substantial wasted computation.<n>We propose DeepPrune, a novel framework that enables efficient parallel scaling through dynamic pruning.<n>Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient.
arXiv Detail & Related papers (2025-10-09T17:24:54Z)
Parallel-R1: Towards Parallel Thinking via Reinforcement Learning [65.68667585027232]
Parallel thinking is a novel approach for enhancing the reasoning capabilities of large language models.<n>We propose textbfParallel-R1, the first reinforcement learning framework that enables parallel thinking behaviors.<n>Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking.
arXiv Detail & Related papers (2025-09-09T17:59:35Z)
Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z)
Training Chain-of-Thought via Latent-Variable Inference [30.21067593018967]
Large language models (LLMs) solve problems more accurately and interpretably when instructed to work out the answer step by step using a chain-of-thought'' prompt. Naively combining CoT with supervised tuning requires supervision not just of the correct answers, but also of detailed rationales that lead to those answers. We propose a fine-tuning strategy that tries to maximize the emphmarginal log-likelihood of generating a correct answer using CoT prompting.
arXiv Detail & Related papers (2023-11-28T17:47:32Z)
Bridging RL Theory and Practice with the Effective Horizon [18.706109961534676]
We show that prior bounds do not correlate well with when deep RL succeeds vs. fails. We generalize this into a new complexity measure of an MDP that we call the effective horizon. We also find that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy.
arXiv Detail & Related papers (2023-04-19T17:59:01Z)
On Effective Parallelization of Monte Carlo Tree Search [51.15940034629022]
Monte Carlo Tree Search (MCTS) is computationally expensive as it requires a substantial number of rollouts to construct the search tree. How to design effective parallel MCTS algorithms has not been systematically studied and remains poorly understood. We demonstrate how proposed necessary conditions can be adopted to design more effective parallel MCTS algorithms.
arXiv Detail & Related papers (2020-06-15T21:36:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.