Fugu-MT 論文翻訳(概要): Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training

論文の概要: Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training

arxiv url: http://arxiv.org/abs/2605.07316v1
Date: Fri, 08 May 2026 06:25:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.854697
Title: Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training
Title（参考訳）: インプシブ圧縮規則化:RL後試験における内部短距離分布による簡潔推論
Authors: Chen Wang, Hexuan Deng, Yining Zhang, Yuchen Zhang, Jionghao Bai, Zhaochun Li, Ge Lan, Yue Wang,
Abstract要約: 本稿では,ロールアウトグループにおける最短応答によって引き起こされる仮想的短値分布から圧縮信号を得るオンライン正規化手法を提案する。 3つの推論バックボーンの実験と、複数の数学的および知識集約ベンチマークにより、ICRは、正確性を維持したり改善したりしながら、応答を一貫して短縮することが示された。
参考スコア（独自算出の注目度）: 11.132427208920424
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning with verifiable rewards improves LLM reasoning but often induces overthinking, where models generate unnecessarily long reasoning traces. Existing methods mainly rely on length penalties or early-exit strategies; however, the former may degrade accuracy and induce underthinking, whereas the latter assumes that substantial portions of reasoning traces can be safely truncated. To obtain a compression signal without these limitations, we revisit the training dynamics of existing compression methods. We observe that the length--accuracy correlation is initially negative but continually increases during compression, indicating that shorter responses are initially more likely to be correct but gradually lose this property as the policy moves toward underthinking. Based on this observation, we formalize overthinking: a negative correlation indicates an overthinking regime, while a positive one indicates underthinking. When overthinking, the shortest correct responses are shorter than the group-average response length in expectation, making them natural compression targets already present in on-policy rollouts. We therefore propose \emph{Implicit Compression Regularization} (ICR), an on-policy regularization method whose compression signal comes from a virtual shorter distribution induced by the shortest correct responses in rollout groups, guiding the policy toward concise yet correct trajectories. Training dynamics show that ICR maintains a better length--accuracy correlation during compression, indicating that short responses remain better aligned with correctness instead of drifting toward underthinking. Experiments on three reasoning backbones and multiple mathematical and knowledge-intensive benchmarks show that ICR consistently shortens responses while preserving or improving accuracy, achieving a stronger accuracy--length Pareto frontier.
Abstract（参考訳）: 検証可能な報酬による強化学習は、LLM推論を改善するが、しばしば過度な思考を引き起こす。既存の手法は、主に長さのペナルティや早期退行戦略に依存しているが、前者は精度を低下させ、過小評価を誘発し、後者は推論トレースのかなりの部分を安全に切り離すことができると仮定する。このような制約を伴わずに圧縮信号を得るため,既存の圧縮手法のトレーニング力学を再考する。長値相関は最初は否定的だが圧縮中は継続的に増加し, より短い応答は最初は正しがちだが, 政策が未考に進むにつれ, この特性は徐々に失われていくことが示唆された。負の相関は過剰な思考体制を示すが、正の相関は過大な思考を示す。過度に考えると、最も短い正しい応答は期待するグループ平均応答長よりも短いので、既にオンラインロールアウトに存在する自然な圧縮ターゲットとなる。そこで,本研究では,ロールアウト群における最短応答による仮想的短時間分布から圧縮信号が導出され,簡潔で正確な軌道に対するポリシーを導出するオンライン正規化手法である 'emph{Implicit Compression Regularization} (ICR) を提案する。トレーニング力学は、ICRが圧縮の間、より長い精度の相関を保ち、短い応答は、下向きにドリフトするよりは、より正確であることを示す。 3つの推論バックボーンの実験と、複数の数学的および知識集約的なベンチマークにより、ICRは、正確性を維持したり改善したりしながら応答を一貫して短縮し、より強力な精度のパレートフロンティアを実現する。

論文の概要: Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training

関連論文リスト