Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization
- URL: http://arxiv.org/abs/2602.10048v1
- Date: Tue, 10 Feb 2026 18:15:58 GMT
- Title: Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization
- Authors: Xinchen Han, Hossam Afifi, Michel Marot, Xilu Wang, Lu Yin,
- Abstract summary: Large Language Models (LLMs) generate unnecessarily verbose Chain-of-Thought (CoT) reasoning.<n>We propose textbfFine-grained textbfGroup policy textbfOptimization (textbfFGO)<n>FGO refines group responses by subdividing them and assigning appropriate weights based on length and entropy, thereby enabling effective CoT compression.
- Score: 6.221775342067641
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) often generate unnecessarily verbose Chain-of-Thought (CoT) reasoning that increases computational costs and latency without proportional performance gains. In this paper, we propose \textbf{F}ine-grained \textbf{G}roup policy \textbf{O}ptimization (\textbf{FGO}), a Reinforcement Learning (RL) algorithm that refines group responses by subdividing them and assigning appropriate weights based on length and entropy, thereby enabling effective CoT compression. Meanwhile, as an enhanced variant of Group Relative Policy Optimization (GRPO), FGO successfully addresses two major limitations of the GRPO: inefficient data utilization and entropy collapse. We evaluate FGO on multiple reasoning LLMs and benchmarks, including MATH500, AIME24, AMC23, and Minerva. Experimental results show that FGO achieves efficient CoT compression without degrading performance, and simultaneously resolves the key limitations of GRPO.
Related papers
- Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z) - WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization for Rollout-Efficient Reasoning [67.45237332694025]
Group Relative Policy Optimization is effective for training language models on complex reasoning.<n>We propose Weakly Supervised GRPO, which improves rollout efficiency by converting terminal rewards into correctness-aware guidance.
arXiv Detail & Related papers (2026-02-19T02:43:35Z) - iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z) - Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression [55.63153956934198]
Chain-of-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs)<n>Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios.<n>We propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy.
arXiv Detail & Related papers (2026-02-09T06:57:15Z) - Short Chains, Deep Thoughts: Balancing Reasoning Efficiency and Intra-Segment Capability via Split-Merge Optimization [68.89915707647138]
Large Reasoning Models (LRMs) have demonstrated impressive capabilities in solving complex tasks through the generation of long reasoning chains.<n>We propose textbfCoSMo (textbfSplit-textbfMerge textbfOptimization), a framework designed to eliminate structural redundancy rather than indiscriminately restricting token volume.
arXiv Detail & Related papers (2026-02-03T05:54:28Z) - IRPO: Scaling the Bradley-Terry Model via Reinforcement Learning [11.499402258204375]
Intergroup Relative Preference Optimization (IRPO) is a novel RL framework that incorporates the well-established Bradley-Terry model into GRPO.<n>By generating a pointwise score for each response, IRPO enables efficient evaluation of arbitrarily many candidates during RL training.<n> Experimental results demonstrate that IRPO achieves state-of-the-art (SOTA) performance among pointwise GRMs.
arXiv Detail & Related papers (2026-01-02T12:57:06Z) - GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA [6.07907277934348]
GIFT is a novel reinforcement learning framework for alignings.<n>It minimizes discrepancy between implicit and explicit reward models.<n>It achieves superior reasoning and alignment performance on mathematical benchmarks.
arXiv Detail & Related papers (2025-10-27T21:18:19Z) - PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z) - Prefix Grouper: Efficient GRPO Training through Shared-Prefix Forward [10.640867597958863]
We propose Prefix Grouper, an efficient GRPO training algorithm that eliminates redundant prefixes via a Shared-Prefix Forward strategy.<n>By restructuring self-attention into two parts, our method enables the shared prefix to be encoded only once.<n>We provide both theoretical and empirical evidence that Prefix Grouper is training-equivalent to standard GRPO.
arXiv Detail & Related papers (2025-06-05T09:13:37Z) - VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization [59.39976343879587]
VerIPO aims to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains.<n>The training loop benefits from GRPO's expansive search and DPO's targeted optimization.<n>Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs.
arXiv Detail & Related papers (2025-05-25T06:41:28Z) - Balancing LoRA Performance and Efficiency with Simple Shard Sharing [8.827921242078883]
textbfOptimal textbfShard textbfSharing textbfIntegration in textbfLoRA, a novel PEFT approach that addresses this trade-off through a simple shard-sharing mechanism.<n>Fossils significantly outperforms standard LoRA and its prominent variants in both model performance metrics and computational efficiency.
arXiv Detail & Related papers (2024-09-19T10:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.