Fugu-MT 論文翻訳(概要): OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

論文の概要: OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

arxiv url: http://arxiv.org/abs/2506.20512v1
Date: Wed, 25 Jun 2025 14:58:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-26 21:00:42.799549
Title: OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling
Title（参考訳）: OctoThinker: ミッドトレーニングは強化学習のスケーリングにインセンティブを与える
Authors: Zengzhi Wang, Fan Zhou, Xuefeng Li, Pengfei Liu,
Abstract要約: LlamaやQwenのような異なる言語モデルファミリーは、強化学習(RL)による後訓練中に異なる行動を示す本研究では,MegaMath-Web-Proのような高品質な数学的コーパスがベースモデルとRL性能の両方を著しく改善することを明らかにする。 2段階の中間訓練戦略であるStable-then-Decayを導入し、ベースモデルを学習率を一定とした200Bトークンでトレーニングし、その後CoTに着目した3つのブランチで20Bトークンを学習速度を劣化させた。
参考スコア（独自算出の注目度）: 29.818409458662344
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Different base language model families, such as Llama and Qwen, exhibit divergent behaviors during post-training with reinforcement learning (RL), especially on reasoning-intensive tasks. What makes a base language model suitable for reinforcement learning? Gaining deeper insight into this question is essential for developing RL-scalable foundation models of the next generation. In this work, we investigate how mid-training strategies shape RL dynamics, focusing on two representative model families: Qwen and Llama. Our study reveals that (1) high-quality mathematical corpora, such as MegaMath-Web-Pro, significantly improve both base model and RL performance, while existing alternatives (e.g., FineMath-4plus) fail to do so; (2) further adding QA-style data, particularly long chain-of-thought (CoT) reasoning examples, enhances RL outcomes, and instruction data further unlocks this effect; (3) while long-CoT improves reasoning depth, it can also induce verbosity of model responses and unstability of RL training, underscoring the importance of data formatting; (4) scaling mid-training consistently leads to stronger downstream RL performance. Building on these insights, we introduce a two-stage mid-training strategy, Stable-then-Decay, in which base models are first trained on 200B tokens with a constant learning rate, followed by 20B tokens across three CoT-focused branches with learning rate decay. This yields OctoThinker, a family of models demonstrating strong RL compatibility and closing the performance gap with more RL-friendly model families, i.e., Qwen. We hope our work will help shape pre-training strategies for foundation models in the RL era. To support further research, we release our open-source models along with a curated math reasoning-intensive corpus of over 70 billion tokens (i.e., MegaMath-Web-Pro-Max).
Abstract（参考訳）: Llama や Qwen のような異なる基礎言語モデルファミリーは、強化学習(RL)による後学習において、特に推論集約的なタスクにおいて、異なる振る舞いを示す。強化学習に適したベース言語モデルとは何か? この問題に対する深い洞察を得ることは、次世代のRLスケーリング可能な基礎モデルを開発する上で不可欠である。本研究では,中級学習戦略がRL力学をどのように形成するかを考察し,QwenとLlamaの2つの代表的なモデルファミリに着目した。我々は,(1)MegaMath-Web-Proのような高品質な数学的コーパスは,ベースモデルとRL性能の両方を著しく改善するが,(2)既存の代替案(例えば FineMath-4plus)は実現しない。(2)QAスタイルのデータ,特に長いチェーン・オブ・シークレット(CoT)推論の例を追加し,RL結果を強化し,命令データがさらにこの効果を解き放つ。(3)ロングCoTは推論深さを改善する一方で,モデル応答の冗長性やRLトレーニングの不安定性を誘導し,データフォーマットの重要性を裏付ける。これらの知見に基づいて、我々は2段階の中間訓練戦略であるStable-then-Decayを導入し、ベースモデルを学習率を一定とした200Bトークンでトレーニングし、その後CoTにフォーカスした3つのブランチで20Bトークンを学習速度を低下させる。これは、強力なRL互換を示すモデルのファミリーであるOctoThinkerを、よりRLフレンドリなモデルファミリ、すなわちQwenで性能ギャップを埋める。 RL時代の基盤モデルのための事前学習戦略の形成を支援することを願っています。さらなる研究を支援するため、我々は700億以上のトークン(MegaMath-Web-Pro-Max)の計算された推論集約コーパスとともに、オープンソースモデルをリリースします。

論文の概要: OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

関連論文リスト