PACE: Prefix-Protected and Difficulty-Aware Compression for Efficient Reasoning
- URL: http://arxiv.org/abs/2602.11639v1
- Date: Thu, 12 Feb 2026 06:43:08 GMT
- Title: PACE: Prefix-Protected and Difficulty-Aware Compression for Efficient Reasoning
- Authors: Ruixiang Feng, Yuntao Wen, Silin Zhou, Ke Shi, Yifan Wang, Ran Le, Zhenwei An, Zongchao Chen, Chen Yang, Guangyue Peng, Yiming Jia, Dongsheng Wang, Tao Zhang, Lisi Chen, Yang Song, Shen Gao, Shuo Shang,
- Abstract summary: Language Reasoning Models (LRMs) achieve strong performance by scaling test-time computation but often suffer from overthinking''<n>We propose textbfmodel, a dual-level framework for prefix-protected and difficulty-aware compression under hierarchical supervision.
- Score: 37.125266434955584
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Language Reasoning Models (LRMs) achieve strong performance by scaling test-time computation but often suffer from ``overthinking'', producing excessively long reasoning traces that increase latency and memory usage. Existing LRMs typically enforce conciseness with uniform length penalties, which over-compress crucial early deduction steps at the sequence level and indiscriminately penalize all queries at the group level. To solve these limitations, we propose \textbf{\model}, a dual-level framework for prefix-protected and difficulty-aware compression under hierarchical supervision. At the sequence level, prefix-protected optimization employs decaying mixed rollouts to maintain valid reasoning paths while promoting conciseness. At the group level, difficulty-aware penalty dynamically scales length constraints based on query complexity, maintaining exploration for harder questions while curbing redundancy on easier ones. Extensive experiments on DeepSeek-R1-Distill-Qwen (1.5B/7B) demonstrate that \model achieves a substantial reduction in token usage (up to \textbf{55.7\%}) while simultaneously improving accuracy (up to \textbf{4.1\%}) on math benchmarks, with generalization ability to code, science, and general domains.
Related papers
- Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning [66.22060690012512]
Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy.<n>We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution.
arXiv Detail & Related papers (2026-02-27T20:23:59Z) - Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning [39.72119774004103]
Chain-of-Thought (CoT) has substantially empowered Large Language Models (LLMs) to tackle complex reasoning tasks.<n>The verbose nature of explicit reasoning steps incurs prohibitive inference latency and computational costs, limiting real-world deployment.<n>We propose Compress responses for Easy questions and Explore Hard ones (CEEH), a difficulty-aware approach to RL-based efficient reasoning.
arXiv Detail & Related papers (2026-02-26T05:47:30Z) - Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression [55.63153956934198]
Chain-of-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs)<n>Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios.<n>We propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy.
arXiv Detail & Related papers (2026-02-09T06:57:15Z) - Short Chains, Deep Thoughts: Balancing Reasoning Efficiency and Intra-Segment Capability via Split-Merge Optimization [68.89915707647138]
Large Reasoning Models (LRMs) have demonstrated impressive capabilities in solving complex tasks through the generation of long reasoning chains.<n>We propose textbfCoSMo (textbfSplit-textbfMerge textbfOptimization), a framework designed to eliminate structural redundancy rather than indiscriminately restricting token volume.
arXiv Detail & Related papers (2026-02-03T05:54:28Z) - TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs [57.217593337454026]
TokenSqueeze is a novel Long2Short method that condenses reasoning paths while preserving performance and relying exclusively on self-generated data.<n>We show that TokenSqueeze reduces token usage while maintaining accuracy on the MATH500 benchmark.
arXiv Detail & Related papers (2025-11-17T10:38:56Z) - Beyond Token Length: Step Pruner for Efficient and Accurate Reasoning in Large Language Models [26.88030285500965]
Large Reasoning Models (LRMs) demonstrate strong performance on complex tasks but often suffer from excessive verbosity, known as "overthinking"<n>We introduce textbfStep Pruner (SP), an RL framework that steers LRMs toward more efficient reasoning by favoring compact reasoning steps.<n>Our step-aware reward function prioritizes correctness while imposing penalties for redundant steps, and withholds rewards for incorrect responses to prevent the reinforcement of erroneous reasoning.
arXiv Detail & Related papers (2025-10-04T13:24:26Z) - Intra-request branch orchestration for efficient LLM reasoning [52.68946975865865]
Large Language Models (LLMs) increasingly rely on inference-time reasoning algorithms to improve accuracy on complex tasks.<n>Prior work has largely focused on reducing token usage, often at the expense of accuracy, while overlooking other latency factors.<n>We present DUCHESS, an LLM serving system that reduces cost and latency without sacrificing accuracy through intra-request branch orchestration guided by predictions.
arXiv Detail & Related papers (2025-09-29T15:52:08Z) - Pruning the Unsurprising: Efficient Code Reasoning via First-Token Surprisal [13.035073453917088]
Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in code reasoning by scaling up the length of Chain-of-Thought (CoT)<n>We propose ASAP (Anchor-guided, Surprisal-based Pruning), a novel coarse-to-fine framework for CoT compression.<n> ASAP achieves state-of-the-art accuracy across multiple code generation benchmarks while substantially reducing training and inference costs.
arXiv Detail & Related papers (2025-08-08T03:46:21Z) - SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control [5.224609066309358]
Large reasoning models (LRMs) have exhibited remarkable reasoning capabilities through inference-time scaling.<n>Previous work has attempted to mitigate this issue by penalizing the overall length of generated samples during reinforcement learning.<n>We propose SmartThinker, a two-stage learnable framework designed to enable fine-grained control over the length of reasoning chains.
arXiv Detail & Related papers (2025-07-06T11:21:47Z) - ConciseHint: Boosting Efficient Reasoning via Continuous Concise Hints during Generation [74.37307916314407]
We propose a framework dubbed ConciseHint, which continuously encourages the reasoning model to speak concisely.<n>Experiments on the state-of-the-art LRMs, including DeepSeek-R1 and Qwen-3 series, demonstrate that our method can effectively produce concise reasoning.
arXiv Detail & Related papers (2025-06-23T16:20:44Z) - ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy [8.962703809086628]
ThinkLess is an inference-efficient framework that terminates reasoning generation early and maintains output quality without modifying the model.<n>We show that ThinkLess achieves comparable accuracy to full-length Chain-of-Thought (CoT) decoding while greatly reducing decoding time and memory consumption.
arXiv Detail & Related papers (2025-05-21T15:58:16Z) - DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models [30.184895117009457]
This paper introduces Difficulty-Adaptive Slow Thinking (DAST), a novel framework that enables models to autonomously adjust the length of Chain-of-Thought (CoT) based on problem difficulty.<n>Experiments on diverse datasets and model scales demonstrate that DAST effectively mitigates overthinking while preserving reasoning accuracy on complex problems.
arXiv Detail & Related papers (2025-03-06T14:23:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.