CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation
- URL: http://arxiv.org/abs/2502.21074v3
- Date: Tue, 23 Sep 2025 08:16:08 GMT
- Title: CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation
- Authors: Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, Yulan He,
- Abstract summary: We introduce CODI (Continuous Chain-of-Thought via Self-Distillation), a novel training framework that effectively compresses natural language CoT into continuous space.<n>Experiments show CODI is the first implicit CoT approach to match the performance of explicit CoT on GSM8k at the GPT-2 scale.
- Score: 30.762815456866083
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Chain-of-Thought (CoT) reasoning enhances Large Language Models (LLMs) by encouraging step-by-step reasoning in natural language. However, leveraging a latent continuous space for reasoning may offer benefits in terms of both efficiency and robustness. Prior implicit CoT methods attempt to bypass language completely by reasoning in continuous space but have consistently underperformed compared to the standard explicit CoT approach. We introduce CODI (Continuous Chain-of-Thought via Self-Distillation), a novel training framework that effectively compresses natural language CoT into continuous space. CODI jointly trains a teacher task (Explicit CoT) and a student task (Implicit CoT), distilling the reasoning ability from language into continuous space by aligning the hidden states of a designated token. Our experiments show that CODI is the first implicit CoT approach to match the performance of explicit CoT on GSM8k at the GPT-2 scale, achieving a 3.1x compression rate and outperforming the previous state-of-the-art by 28.2% in accuracy. CODI also demonstrates robustness, generalizable to complex datasets, and interpretability. These results validate that LLMs can reason effectively not only in natural language, but also in a latent continuous space. Code is available at https://github.com/zhenyi4/codi.
Related papers
- CoLT: Reasoning with Chain of Latent Tool Calls [31.228763375347608]
Chain-of-Thought (CoT) is a critical technique in enhancing the reasoning ability of Large Language Models (LLMs)<n>We propose CoLT, a novel framework that implements latent reasoning as tool calls''
arXiv Detail & Related papers (2026-02-04T06:12:53Z) - ImgCoT: Compressing Long Chain of Thought into Compact Visual Tokens for Efficient Reasoning of Large Language Model [34.90582960625524]
Long chains of thought (CoT) into compact latent tokens is crucial for efficient reasoning with large language models (LLMs)<n>We propose ImgCoT that replaces the reconstruction target from textual CoT to the visual CoT obtained by rendering CoT into images.<n>This substitutes linguistic bias with spatial inductive bias, enabling latent tokens to better capture global reasoning structure.
arXiv Detail & Related papers (2026-01-30T09:06:45Z) - SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens [43.78883511257627]
Chain-of-Thought (CoT) reasoning hinders its mass deployment in efficiency-critical applications.<n>We propose a novel semantically-aligned implicit CoT framework termed SemCoT.
arXiv Detail & Related papers (2025-10-28T20:11:54Z) - Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning [65.20602712957725]
Caco is a novel framework that automates the synthesis of high-quality, verifiable, and diverse instruction-CoT reasoning data.<n>Our work establishes a paradigm for building self-sustaining, trustworthy reasoning systems without human intervention.
arXiv Detail & Related papers (2025-10-05T07:59:24Z) - SIM-CoT: Supervised Implicit Chain-of-Thought [108.30049193668083]
Implicit Chain-of-Thought (CoT) methods offer a token-efficient alternative to explicit CoT reasoning in Large Language Models.<n>We identify a core latent instability issue when scaling the computational budget of implicit CoT.<n>We propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space.
arXiv Detail & Related papers (2025-09-24T17:01:32Z) - ECCoT: A Framework for Enhancing Effective Cognition via Chain of Thought in Large Language Model [1.0760366210656895]
ECCoT is a framework to evaluate and refine reasoning chains in Large Language Models.<n>It improves interpretability, reduces biases, and enhances the trustworthiness of LLM-based decision-making.
arXiv Detail & Related papers (2025-06-24T13:09:53Z) - SoftCoT++: Test-Time Scaling with Soft Chain-of-Thought Reasoning [48.28847964704554]
Test-Time Scaling (TTS) refers to approaches that improve reasoning performance by allocating extra computation during inference.<n>Recent studies in Coconut and SoftCoT have demonstrated that thinking in the continuous latent space can further enhance the reasoning performance.<n>We introduce SoftCoT++ to extend SoftCoT to the Test-Time Scaling paradigm by enabling diverse exploration of thinking paths.
arXiv Detail & Related papers (2025-05-16T17:47:50Z) - CoT2Align: Cross-Chain of Thought Distillation via Optimal Transport Alignment for Language Models with Different Tokenizers [45.59157559718677]
Large Language Models (LLMs) achieve state-of-the-art performance across various NLP tasks but face deployment challenges due to high computational costs and memory constraints.<n> Knowledge distillation (KD) is a promising solution, transferring knowledge from large teacher models to smaller student models.<n>We propose CoT2Align, a universal KD framework that integrates Chain-of-Thought (CoT) augmentation and introduces Cross-CoT Alignment to enhance reasoning transfer.
arXiv Detail & Related papers (2025-02-24T03:30:29Z) - C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness [18.073777359647515]
Chain-of-Thought (CoT) before deriving the answer can improve the reasoning capabilities of large language models (LLMs)<n>However, the length of the generated CoT is much longer than the desired final answer, which results in additional decoding costs.<n>This paper presents a CoT compression framework that involves a compressor to compress an original longer CoT into a shorter CoT.
arXiv Detail & Related papers (2024-12-16T11:12:45Z) - Training Large Language Models to Reason in a Continuous Latent Space [71.0274000348354]
We introduce a new paradigm called Coconut (Chain of Continuous Thought) to explore the potential of reasoning beyond language.<n>Instead of decoding this state into words, we feed it back to the model as the next input embedding directly in the continuous space.<n>This latent reasoning paradigm enables an advanced reasoning pattern, where continuous thoughts can encode multiple alternative next steps.
arXiv Detail & Related papers (2024-12-09T18:55:56Z) - To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning [55.52872152909785]
Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs)
We show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks.
arXiv Detail & Related papers (2024-09-18T17:55:00Z) - Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding [14.175444025026508]
Large language models (LLMs) have demonstrated remarkable capabilities in tasks requiring chain-of-thought (CoT) prompting.
generating the full CoT process results in significantly longer output sequences, leading to increased computational costs and latency during inference.
We propose a novel approach to compress the CoT process through semantic alignment, enabling more efficient decoding while preserving the benefits of CoT reasoning.
arXiv Detail & Related papers (2024-09-13T06:29:20Z) - A Hopfieldian View-based Interpretation for Chain-of-Thought Reasoning [48.51969964676017]
Chain-of-Thought (CoT) holds a significant place in augmenting the reasoning performance for large language models.
We propose a Read-and-Control approach for controlling the accuracy of CoT.
arXiv Detail & Related papers (2024-06-18T04:07:13Z) - ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting [124.69672273754144]
Chain-of-Thought (CoT) prompting can enhance the reasoning capabilities of large language models (LLMs)
Existing CoT approaches usually focus on simpler reasoning tasks and thus result in low-quality and inconsistent CoT prompts.
We introduce CoTGenius, a novel framework designed for the automatic generation of superior CoT prompts.
arXiv Detail & Related papers (2024-03-21T11:34:26Z) - Chain-of-Thought Reasoning Without Prompting [40.92854235219315]
CoT reasoning paths can be elicited from pre-trained language models by simply altering the textitdecoding process.
The presence of a CoT in the decoding path correlates with a higher confidence in the model's decoded answer.
arXiv Detail & Related papers (2024-02-15T18:55:41Z) - Cross-lingual Prompting: Improving Zero-shot Chain-of-Thought Reasoning
across Languages [46.496557448392494]
Chain-of-thought (CoT) is capable of eliciting models to explicitly generate reasoning paths.
Existing zero-shot prompting techniques are limited to a single language.
We introduce cross-lingual prompting (CLP), aiming to improve zero-shot CoT reasoning across languages.
arXiv Detail & Related papers (2023-10-23T10:56:03Z) - Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with
Large Language Models [68.05046964022844]
Large language models (LLMs) have unveiled remarkable reasoning capabilities by exploiting chain-of-thought (CoT) prompting.
We propose GeM-CoT, a Generalizable CoT prompting mechanism in Mixed-task scenarios where the type of input questions is unknown.
With this technical design, GeM-CoT simultaneously enjoys superior generalization capabilities and remarkable performances on 10 public reasoning tasks and 23 BBH tasks.
arXiv Detail & Related papers (2023-10-10T15:10:03Z) - CTC-based Non-autoregressive Speech Translation [51.37920141751813]
We investigate the potential of connectionist temporal classification for non-autoregressive speech translation.
We develop a model consisting of two encoders that are guided by CTC to predict the source and target texts.
Experiments on the MuST-C benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$times$.
arXiv Detail & Related papers (2023-05-27T03:54:09Z) - Multimodal Chain-of-Thought Reasoning in Language Models [94.70184390935661]
We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework.
Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach.
arXiv Detail & Related papers (2023-02-02T07:51:19Z) - Towards Understanding Chain-of-Thought Prompting: An Empirical Study of
What Matters [82.84696222087396]
Chain-of-Thought (CoT) prompting can dramatically improve the multi-step reasoning abilities of large language models (LLMs)
We show that CoT reasoning is possible even with invalid demonstrations.
arXiv Detail & Related papers (2022-12-20T05:20:54Z) - CTC Alignments Improve Autoregressive Translation [145.90587287444976]
We argue that CTC does in fact make sense for translation if applied in a joint CTC/attention framework.
Our proposed joint CTC/attention models outperform pure-attention baselines across six benchmark translation tasks.
arXiv Detail & Related papers (2022-10-11T07:13:50Z) - Learning Implicitly with Noisy Data in Linear Arithmetic [94.66549436482306]
We extend implicit learning in PAC-Semantics to handle intervals and threshold uncertainty in the language of linear arithmetic.
We show that our implicit approach to learning optimal linear programming objective constraints significantly outperforms an explicit approach in practice.
arXiv Detail & Related papers (2020-10-23T19:08:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.