Fugu-MT 論文翻訳(概要): Motif-2-12.7B-Reasoning: A Practitioner's Guide to RL Training Recipes

論文の概要: Motif-2-12.7B-Reasoning: A Practitioner's Guide to RL Training Recipes

arxiv url: http://arxiv.org/abs/2512.11463v1
Date: Thu, 11 Dec 2025 00:51:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-15 15:48:11.738205
Title: Motif-2-12.7B-Reasoning: A Practitioner's Guide to RL Training Recipes
Title（参考訳）: Motif-2-12.7B-Reasoning:Practitioner's Guide to RL Training Recipes
Authors: Junghwan Lim, Sungmin Lee, Dongseok Kim, Taehyun Kim, Eunhwan Park, Jeesoo Lee, Jeongdoo Lee, Junhyeok Lee, Wai Ting Cheung, Dahye Choi, Minsu Ha, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Beomgyu Kim, Minjae Kim, Taewhan Kim, Youngrok Kim, Hyukjin Kweon, Haesol Lee, Kungyu Lee, Dongpin Oh, Yeongjae Park, Bokki Ryu, Dongjoo Weon,
Abstract要約: 複雑な推論と長文理解において,オープンウェイトシステムとプロプライエタリフロンティアモデルのギャップを埋めるために設計された12.7Bパラメータ言語モデルを導入する。提案手法は,ハイブリッド並列処理とカーネルレベルの最適化を用いて,64Kのコンテキストに対するメモリ効率のよいインフラストラクチャを組み合わせる。本稿では,難易度を考慮したデータフィルタリングと混成政治軌道再利用によるトレーニングを安定化する,堅牢な強化学習ファインタニングパイプラインについて述べる。
参考スコア（独自算出の注目度）: 7.998815625852598
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Motif-2-12.7B-Reasoning, a 12.7B parameter language model designed to bridge the gap between open-weight systems and proprietary frontier models in complex reasoning and long-context understanding. Addressing the common challenges of model collapse and training instability in reasoning adaptation, we propose a comprehensive, reproducible training recipe spanning system, data, and algorithmic optimizations. Our approach combines memory-efficient infrastructure for 64K-token contexts using hybrid parallelism and kernel-level optimizations with a two-stage Supervised Fine-Tuning (SFT) curriculum that mitigates distribution mismatch through verified, aligned synthetic data. Furthermore, we detail a robust Reinforcement Learning Fine-Tuning (RLFT) pipeline that stabilizes training via difficulty-aware data filtering and mixed-policy trajectory reuse. Empirical results demonstrate that Motif-2-12.7B-Reasoning achieves performance comparable to models with significantly larger parameter counts across mathematics, coding, and agentic benchmarks, offering the community a competitive open model and a practical blueprint for scaling reasoning capabilities under realistic compute constraints.
Abstract（参考訳）: 複雑な推論と長文理解において,オープンウェイトシステムとプロプライエタリなフロンティアモデルとのギャップを埋めるために設計された12.7Bパラメータ言語モデルであるMotif-2-12.7B-Reasoningを導入する。モデル崩壊と推論適応におけるトレーニング不安定性の共通課題に対処し、システム、データ、アルゴリズム最適化にまたがる包括的かつ再現可能なトレーニングレシピを提案する。提案手法は,ハイブリッド並列処理とカーネルレベルの最適化を併用した,64K-tokenコンテキストのメモリ効率のインフラと2段階のSupervised Fine-Tuning (SFT) カリキュラムを組み合わせることで,検証された整列合成データによる分散ミスマッチを緩和する。さらに,頑健な強化学習ファインタニング(RLFT)パイプラインについて述べる。実証的な結果から、Motif-2-12.7B-Reasoningは、数学、コーディング、エージェントベンチマークにまたがる非常に大きなパラメータ数を持つモデルに匹敵する性能を達成し、コミュニティに現実的な計算制約の下で推論能力を拡張するための競争力のあるオープンモデルと実践的な青写真を提供する。

論文の概要: Motif-2-12.7B-Reasoning: A Practitioner's Guide to RL Training Recipes

関連論文リスト