Parallel Continuous Chain-of-Thought with Jacobi Iteration
- URL: http://arxiv.org/abs/2506.18582v1
- Date: Mon, 23 Jun 2025 12:35:41 GMT
- Title: Parallel Continuous Chain-of-Thought with Jacobi Iteration
- Authors: Haoyi Wu, Zhihao Teng, Kewei Tu,
- Abstract summary: Continuous chain-of-thought has been shown to be effective in saving reasoning tokens for large language models.<n>We propose Parallel Continuous Chain-of-Thought (PCCoT) which performs Jacobi on the latent thought tokens, updating them iteratively in parallel instead of sequentially.
- Score: 39.36822246659272
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Continuous chain-of-thought has been shown to be effective in saving reasoning tokens for large language models. By reasoning with continuous latent thought tokens, continuous CoT is able to perform implicit reasoning in a compact manner. However, the sequential dependencies between latent thought tokens spoil parallel training, leading to long training time. In this paper, we propose Parallel Continuous Chain-of-Thought (PCCoT), which performs Jacobi iteration on the latent thought tokens, updating them iteratively in parallel instead of sequentially and thus improving both training and inference efficiency of continuous CoT. Experiments demonstrate that by choosing the proper number of iterations, we are able to achieve comparable or even better performance while saving nearly 50% of the training and inference time. Moreover, PCCoT shows better stability and robustness in the training process. Our code is available at https://github.com/whyNLP/PCCoT.
Related papers
- Latent Reasoning with Supervised Thinking States [60.09942890192309]
Reasoning with a chain-of-thought (CoT) enables Large Language Models (LLMs) to solve complex tasks but incurs significant inference costs.<n>We propose Thinking States, a method that performs reasoning em while the input is processing.<n>We show Thinking States leads to stronger reasoning behavior than CoT, successfully extrapolating to longer sequences than seen during training.
arXiv Detail & Related papers (2026-02-09T07:12:41Z) - Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge [87.51901436392427]
Large language models often solve complex reasoning tasks more effectively with Chain-of-Thought (CoT)<n>Humans, by contrast, often reason softly by maintaining a tractable probability distribution over plausible next steps.<n>We propose Multiplex Thinking, a soft reasoning mechanism that samples K candidate tokens and aggregates their embeddings into a single continuous multiplex token.<n>Multiplex Thinking is self-adaptive: when the model is confident, the multiplex token is nearly discrete and behaves like standard CoT.
arXiv Detail & Related papers (2026-01-13T18:48:00Z) - ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models [99.6720868215076]
We introduce ThreadWeaver, a framework for adaptive parallel reasoning.<n> ThreadWeaver achieves accuracy on par with popular sequential reasoning models of comparable size.<n>We show that ThreadWeaver delivers up to 1.53x average speedup in token latency.
arXiv Detail & Related papers (2025-11-24T18:55:59Z) - CoPRIS: Efficient and Stable Reinforcement Learning via Concurrency-Controlled Partial Rollout with Importance Sampling [11.252930904797]
We propose Concurrency- Controlled Partial Rollout with Importance Sampling (CoPRIS)<n>CoPRIS mitigates long-tail inefficiencies by maintaining a fixed number of concurrent rollouts, early-terminating once sufficient samples are collected, and reusing unfinished trajectories in subsequent rollouts.<n>Experiments show that CoPRIS achieves up to 1.94x faster training while maintaining comparable or superior performance to synchronous RL systems.
arXiv Detail & Related papers (2025-11-05T11:39:32Z) - Rethinking Thinking Tokens: LLMs as Improvement Operators [80.12087211785949]
Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which allows them to explore solution strategies with self-checking.<n>This results in higher accuracy, but inflates context length, token/compute cost, and answer latency.<n>We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier?<n>We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace
arXiv Detail & Related papers (2025-10-01T17:08:59Z) - SIM-CoT: Supervised Implicit Chain-of-Thought [108.30049193668083]
Implicit Chain-of-Thought (CoT) methods offer a token-efficient alternative to explicit CoT reasoning in Large Language Models.<n>We identify a core latent instability issue when scaling the computational budget of implicit CoT.<n>We propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space.
arXiv Detail & Related papers (2025-09-24T17:01:32Z) - Soft Tokens, Hard Truths [17.640897774014707]
This work introduces a scalable method to learn continuous CoTs via reinforcement learning (RL)<n>We use "soft" tokens: mixtures of tokens together with noise on the input embedding to provide RL exploration.<n>On math reasoning benchmarks with Llama and Qwen models up to 8B, training with continuous CoTs match discrete-token CoTs for pass@1 and surpass them for pass@32.
arXiv Detail & Related papers (2025-09-23T15:43:47Z) - Continuous Chain of Thought Enables Parallel Exploration and Reasoning [38.59659461841282]
Current language models generate chain-of-thought traces by autoregressively sampling tokens from a finite vocabulary.<n>Our work examines the benefits of continuously-valued tokens (CoT2) through logical reasoning tasks.<n>We show that CoT2 allows the model to track multiple traces in parallel and quantify its benefits for inference efficiency.
arXiv Detail & Related papers (2025-05-29T16:58:28Z) - To CoT or To Loop? A Formal Comparison Between Chain-of-Thought and Looped Transformers [32.01426831450348]
Chain-of-Thought (CoT) and Looped Transformers have been shown to empirically improve performance on reasoning tasks.<n>We provide a formal analysis of their respective strengths and limitations.
arXiv Detail & Related papers (2025-05-25T17:49:37Z) - Chain-of-Thought Tokens are Computer Program Variables [24.55270838267279]
Chain-of-thoughts (CoT) requires large language models to generate intermediate steps before reaching the final answer.<n>We study the role of CoT tokens in large language models on two compositional tasks.<n>We find that preserving only tokens that store intermediate results would achieve comparable performance.
arXiv Detail & Related papers (2025-05-08T05:32:36Z) - Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding [11.07450742824775]
Speculative decoding aims to accelerate the auto-regressive token generation process of a target Large Language Model.<n>Some approaches employ a draft model with multiple heads to predict a sequence of future tokens, where each head handles a token in the sequence.<n>We propose Gumiho, a hybrid model combining serial and parallel heads.
arXiv Detail & Related papers (2025-03-13T07:55:38Z) - LLM Pretraining with Continuous Concepts [71.98047075145249]
Next token prediction has been the standard training objective used in large language model pretraining.<n>We propose Continuous Concept Mixing (CoCoMix), a novel pretraining framework that combines discrete next token prediction with continuous concepts.
arXiv Detail & Related papers (2025-02-12T16:00:11Z) - SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator [65.62084602011596]
Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks.<n>We have identified a key pattern: certain seemingly meaningless separator tokens (i.e., punctuations) contribute disproportionately to attention scores compared to semantically meaningful tokens.<n>We introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens.
arXiv Detail & Related papers (2024-12-16T18:58:57Z) - RefreshKV: Updating Small KV Cache During Long-form Generation [54.00118604124301]
We propose a new inference method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation.<n>Applying our method to off-the-shelf LLMs achieves comparable speedup to eviction-based methods while improving performance for various long-form generation tasks.
arXiv Detail & Related papers (2024-11-08T18:57:07Z) - Nash CoT: Multi-Path Inference with Preference Equilibrium [40.50811042423615]
Chain of thought (CoT) is a reasoning framework that can enhance the performance of Large Language Models (LLMs) on complex inference tasks.<n>There is no optimal setting for the number of inference paths to obtain better results, which in turn increases the inference cost.<n>We propose Nash CoT by constructing a game system on each path that balances the generation from role-specific LLMs' and the general LLMs' generation.<n>We evaluate Nash CoT across various inference tasks, including Arabic Reasoning, Commonsense Question Answering, and Symbolic Inference.
arXiv Detail & Related papers (2024-06-18T07:46:13Z) - Randomized Block-Diagonal Preconditioning for Parallel Learning [0.0]
We study preconditioned gradient-based optimization methods where the preconditioning matrix has block-diagonal form.
Our main contribution is to demonstrate that the convergence of these methods can significantly be improved by a randomization technique.
arXiv Detail & Related papers (2020-06-24T10:12:36Z) - Accelerating Feedforward Computation via Parallel Nonlinear Equation
Solving [106.63673243937492]
Feedforward computation, such as evaluating a neural network or sampling from an autoregressive model, is ubiquitous in machine learning.
We frame the task of feedforward computation as solving a system of nonlinear equations. We then propose to find the solution using a Jacobi or Gauss-Seidel fixed-point method, as well as hybrid methods of both.
Our method is guaranteed to give exactly the same values as the original feedforward computation with a reduced (or equal) number of parallelizable iterations, and hence reduced time given sufficient parallel computing power.
arXiv Detail & Related papers (2020-02-10T10:11:31Z) - Non-Autoregressive Machine Translation with Disentangled Context
Transformer [70.95181466892795]
State-of-the-art neural machine translation models generate a translation from left to right and every step is conditioned on the previously generated tokens.
We propose an attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts.
Our model achieves competitive, if not better, performance compared to the state of the art in non-autoregressive machine translation while significantly reducing decoding time on average.
arXiv Detail & Related papers (2020-01-15T05:32:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.