Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression
- URL: http://arxiv.org/abs/2602.08324v1
- Date: Mon, 09 Feb 2026 06:57:15 GMT
- Title: Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression
- Authors: Yuntian Tang, Bohan Jia, Wenxuan Huang, Lianyue Zhang, Jiao Xie, Wenxi Li, Wei Li, Jie Hu, Xinghao Chen, Rongrong Ji, Shaohui Lin,
- Abstract summary: Chain-of-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs)<n>Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios.<n>We propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy.
- Score: 55.63153956934198
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Chain-of-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs), yet it incurs substantial computational overhead for inference. Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios, resulting in significant performance degradation. To achieve high-fidelity, fast reasoning, we propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy. To generate reliable, high-fidelity supervision, we first train a dedicated semantically-preserved compressor on mathematical CoT data with fine-grained annotations. An LLM is then fine-tuned on these compressed pairs via a mixed-ratio supervised fine-tuning (SFT), teaching it to follow a spectrum of compression budgets and providing a stable initialization for reinforcement learning (RL). We further propose Constrained and Hierarchical Ratio Policy Optimization (CHRPO) to explicitly incentivize question-solving ability under lower budgets by a hierarchical reward. Experiments on three mathematical reasoning benchmarks show the superiority of Extra-CoT. For example, on MATH-500 using Qwen3-1.7B, Extra-CoT achieves over 73\% token reduction with an accuracy improvement of 0.6\%, significantly outperforming state-of-the-art (SOTA) methods.
Related papers
- CtrlCoT: Dual-Granularity Chain-of-Thought Compression for Controllable Reasoning [29.057579417751203]
Chain-of-thought (CoT) prompting improves LLM reasoning but incurs high latency and memory cost due to verbose traces.<n>We propose textbfCtrlCoT, a dual-granularity CoT compression framework that harmonizes semantic abstraction and token-level pruning.
arXiv Detail & Related papers (2026-01-28T10:38:49Z) - ConMax: Confidence-Maximizing Compression for Efficient Chain-of-Thought Reasoning [46.481679150652205]
Large Reasoning Models generate redundant reasoning paths that inflate computational costs without improving accuracy.<n>In this paper, we introduce ConMax, a novel reinforcement learning framework designed to automatically compress reasoning traces.<n>Experiments across five reasoning datasets demonstrate that ConMax achieves a superior efficiency-performance trade-off.
arXiv Detail & Related papers (2026-01-08T14:22:58Z) - Upfront Chain-of-Thought: A Cooperative Framework for Chain-of-Thought Compression [29.354544133745453]
Upfront CoT (UCoT) is an efficient reasoning framework with upfront thought embedding to automate Chain-of-Thought (CoT) compression.<n>UCoT maintains the powerful reasoning ability of executor while significantly reducing the length of CoT.
arXiv Detail & Related papers (2025-10-09T06:34:31Z) - Reinforced Latent Reasoning for LLM-based Recommendation [92.56166822197919]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks.<n>Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data.<n>In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning.
arXiv Detail & Related papers (2025-05-25T11:03:45Z) - Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains [24.805434364781306]
We introduce Compressed Latent Reasoning (CoLaR), a novel framework that dynamically compresses reasoning processes in latent space.<n>CoLaR achieves 14.1% higher accuracy than latent-based baseline methods at comparable compression ratios.<n>Our RL-enhanced CoLaR demonstrates performance gains of up to 5.4% while dramatically reducing latent reasoning chain length by 82.8%.
arXiv Detail & Related papers (2025-05-22T11:40:26Z) - Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z) - ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning [64.93140713419561]
Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs.<n>Existing fine-tuning-based compression methods either operate post-hoc pruning, risking disruption to reasoning coherence, or rely on sampling-based selection.<n>We introduce ConCISE, a framework designed to generate concise reasoning chains, integrating Confidence Injection to boost reasoning confidence, and Early Stopping to terminate reasoning when confidence is sufficient.
arXiv Detail & Related papers (2025-05-08T01:40:40Z) - Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding [14.175444025026508]
Large language models (LLMs) have demonstrated remarkable capabilities in tasks requiring chain-of-thought (CoT) prompting.
generating the full CoT process results in significantly longer output sequences, leading to increased computational costs and latency during inference.
We propose a novel approach to compress the CoT process through semantic alignment, enabling more efficient decoding while preserving the benefits of CoT reasoning.
arXiv Detail & Related papers (2024-09-13T06:29:20Z) - Quantize Once, Train Fast: Allreduce-Compatible Compression with Provable Guarantees [53.950234267704]
We introduce Global-QSGD, an All-reduce gradient-compatible quantization method.<n>We show that it accelerates distributed training by up to 3.51% over baseline quantization methods.
arXiv Detail & Related papers (2023-05-29T21:32:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.