Fugu-MT 論文翻訳(概要): SIRI: Scaling Iterative Reinforcement Learning with Interleaved Compression

論文の概要: SIRI: Scaling Iterative Reinforcement Learning with Interleaved Compression

arxiv url: http://arxiv.org/abs/2509.25176v1
Date: Mon, 29 Sep 2025 17:59:08 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.153324
Title: SIRI: Scaling Iterative Reinforcement Learning with Interleaved Compression
Title（参考訳）: SIRI:インターリーブ圧縮による反復強化学習のスケーリング
Authors: Haoming Wen, Yushi Bai, Juanzi Li, Jie Tang,
Abstract要約: 大規模共振モデル(LRM)のための簡易かつ効果的なRLアプローチであるInterleaved Compressionを用いたSIRI(Scaling Iterative Reinforcement Learning)を導入する。このトレードオフは、推理予算の圧縮と拡大を反復的に交互に交互に行う訓練体制によって克服できることを示す。また, 各圧縮膨張サイクルの後に, 出力長が減少しても, モデルの性能が向上することがわかった。
参考スコア（独自算出の注目度）: 48.04180854972225
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce SIRI, Scaling Iterative Reinforcement Learning with Interleaved Compression, a simple yet effective RL approach for Large Reasoning Models (LRMs) that enables more efficient and accurate reasoning. Existing studies have observed repetitive thinking patterns in LRMs, and attempts to reduce them often come at the cost of performance. In this paper, we show that this trade-off can be overcome through a training regime that iteratively alternates between compressing and expanding the reasoning budget, by dynamically adjusting the maximum rollout length during training. The compression phase cuts the rollout length, forcing the model to make precise and valuable decisions within a limited context, which effectively reduces redundant tokens and increases reasoning density. The expansion phase then relaxes the length limit, providing space for the model to explore and plan in long-horizon settings. Remarkably, we find that after each compression-expansion cycle, the model's performance improves even as its output length decreases, steadily pushing it closer to the Pareto frontier in the performance-efficiency trade-off. Training on DeepSeek-R1-Distill-Qwen-1.5B, SIRI-low improves performance on AIME24 by 43.2% while reducing token usage by 46.9% after three iterations, and SIRI-high achieves the highest accuracy compared to all other methods (Figure 1). Our findings shed light on the potential of periodically oscillating the LRM's output truncation length during training to dynamically balance exploration and efficiency in reasoning, converging towards an optimal "sweet spot" between the two. Our models are publicly available.
Abstract（参考訳）: SIRI, Scaling Iterative Reinforcement Learning with Interleaved Compression, a simple yet effective RL approach for Large Reasoning Models (LRMs) which enables more efficient and accurate reasoning。既存の研究では、LRMの反復的な思考パターンが観察されており、それらを減らす試みは、しばしばパフォーマンスの犠牲になる。本稿では,このトレードオフを,トレーニング中に最大ロールアウト長さを動的に調整することにより,推理予算の圧縮と拡張を反復的に交互に交互に行う訓練体制を通じて克服できることを示す。圧縮フェーズはロールアウト期間を短縮し、モデルに限られたコンテキスト内で正確で価値のある決定を強制し、冗長なトークンを効果的に削減し、推論密度を増大させる。拡張フェーズは長さ制限を緩和し、長い水平設定でモデルを探索し計画するためのスペースを提供する。顕著なことに, 各圧縮膨張サイクルの後に, 出力長が減少しても, モデルの性能が向上し, 性能・効率トレードオフにおいてパレートフロンティアに着実に近づく。 DeepSeek-R1-Distill-Qwen-1.5Bのトレーニングでは、SIRI-lowはAIME24のパフォーマンスを43.2%改善し、3回のイテレーションでトークン使用量を46.9%削減した。実験では, LRMの出力トランケーション長を周期的に振動させ, 推理における探索と効率を動的にバランスさせ, 両者の最適な「スイートスポット」に収束させる可能性が示唆された。私たちのモデルは公開されています。

論文の概要: SIRI: Scaling Iterative Reinforcement Learning with Interleaved Compression

関連論文リスト