Divide-and-Conquer CoT: RL for Reducing Latency via Parallel Reasoning
- URL: http://arxiv.org/abs/2601.23027v1
- Date: Fri, 30 Jan 2026 14:37:07 GMT
- Title: Divide-and-Conquer CoT: RL for Reducing Latency via Parallel Reasoning
- Authors: Arvind Mahankali, Kaiyue Wen, Tengyu Ma,
- Abstract summary: We propose to train Divide-and-Conquer CoT (DC-CoT) to reduce the latency.<n>DC-CoT can act as a director that identifies distinct subtasks that can be performed in parallel in its reasoning process, and then spawns workers to execute the subtasks.<n>Our goal is to achieve high accuracy, with a low longest path length, which is a theoretical measure of the latency needed for the response.
- Score: 18.5812457692667
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Long chain-of-thought reasoning (Long CoT) is now fundamental to state-of-the-art LLMs, especially in mathematical reasoning. However, LLM generation is highly sequential, and long CoTs lead to a high latency. We propose to train Divide-and-Conquer CoT (DC-CoT) to reduce the latency. With DC-CoT, the model can act as a director that identifies distinct subtasks that can be performed in parallel in its reasoning process, and then spawns workers to execute the subtasks. Our goal is to achieve high accuracy, with a low longest path length, which is a theoretical measure of the latency needed for the response. We start with a long CoT base model (DeepScaleR-1.5B-Preview), and first use SFT with a small curated demonstration set to initialize its ability to spawn workers in a certain format. Because SFT degrades the accuracy significantly, we design a multi-stage RL algorithm, with various data filtering strategies, to recover the accuracy while decreasing the longest path length. Across several benchmarks including AIME 2024 and HMMT 2025, DC-CoT achieves similar accuracy as DeepScaleR-1.5B-Preview while decreasing longest path length by 35-40%. Our code, SFT dataset and models are publicly available at https://github.com/amahankali10/DC_CoT_RL_for_Low_Latency_CoT_with_Parallel_Reasoning.
Related papers
- KLong: Training LLM Agent for Extremely Long-horizon Tasks [58.68395081637727]
This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks.<n>We first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training.<n> Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench.
arXiv Detail & Related papers (2026-02-19T17:01:08Z) - AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding [35.10915929939651]
Test-time scaling (TTS) boosts LLM reasoning via long chain-of-thought (CoT)<n> KV-cache growth amplifies the memory-bound bottleneck of LLM decoding.<n>We propose AsyncSpade, an asynchronous framework for efficient TTS built on two core components.
arXiv Detail & Related papers (2025-10-08T19:36:11Z) - Rethinking Thinking Tokens: LLMs as Improvement Operators [80.12087211785949]
Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which allows them to explore solution strategies with self-checking.<n>This results in higher accuracy, but inflates context length, token/compute cost, and answer latency.<n>We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier?<n>We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace
arXiv Detail & Related papers (2025-10-01T17:08:59Z) - dParallel: Learnable Parallel Decoding for dLLMs [77.24184219948337]
Diffusion large language models (dLLMs) offer parallel token prediction and lower inference latency.<n>Existing open-source models still require nearly token-length decoding steps to ensure performance.<n>We introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling.
arXiv Detail & Related papers (2025-09-30T16:32:52Z) - ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time [4.737679362712655]
ourmodelacronym(Extend at Test-Time) is a method for extending the context length of short context Transformer-based Language Models.<n>We evaluate ETT on LongBench by extending the context length of GPT-Large and Phi-2 up to 32 times, increasing from 1k to 32k tokens.
arXiv Detail & Related papers (2025-07-08T18:06:45Z) - R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search [61.4807238517108]
Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by enabling step-by-step problem-solving.<n>CoT's extension to Long-CoT introduces substantial computational overhead due to increased token length.<n>We propose R1-Compress, a two-stage chunk-level compression framework that preserves both local information and coherence.
arXiv Detail & Related papers (2025-05-22T16:06:59Z) - Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers [57.95157497749428]
We propose RL$V$ that augments any value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier.<n> RL$V$ boosts MATH accuracy by over 20% with parallel sampling and enables $8-32times$ efficient test-time compute scaling.
arXiv Detail & Related papers (2025-05-07T22:41:26Z) - When More is Less: Understanding Chain-of-Thought Length in LLMs [51.631483479081645]
Large Language Models (LLMs) employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems.<n>This paper argues that longer CoTs are often presumed superior, arguing that longer is not always better.
arXiv Detail & Related papers (2025-02-11T05:28:59Z) - SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding [28.76164449548306]
Multi-Draft Speculative Decoding (MDSD) offers a promising solution by using a smaller draft model to generate multiple token sequences.
We present SpecHub, a novel, efficient sampling-verification method for MDSD that improves acceptance rates with only linear computational overhead.
arXiv Detail & Related papers (2024-11-08T02:47:07Z) - Shallow Cross-Encoders for Low-Latency Retrieval [69.06104373460597]
Cross-Encoders based on large transformer models (such as BERT or T5) are computationally expensive and allow for scoring only a small number of documents within a reasonably small latency window.
We show that weaker shallow transformer models (i.e., transformers with a limited number of layers) actually perform better than full-scale models when constrained to these practical low-latency settings.
arXiv Detail & Related papers (2024-03-29T15:07:21Z) - Scaling Sparse Fine-Tuning to Large Language Models [67.59697720719672]
Large Language Models (LLMs) are difficult to fully fine-tune due to their sheer number of parameters.
We propose SpIEL, a novel sparse finetuning method which maintains an array of parameter indices and the deltas of these parameters relative to their pretrained values.
We show that SpIEL is superior to popular parameter-efficient fine-tuning methods like LoRA in terms of performance and comparable in terms of run time.
arXiv Detail & Related papers (2024-01-29T18:43:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.