Fugu-MT 論文翻訳(概要): Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum

論文の概要: Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum

arxiv url: http://arxiv.org/abs/2603.18325v1
Date: Wed, 18 Mar 2026 22:17:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-20 17:19:05.867319
Title: Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum
Title（参考訳）: カリキュラムによる推論の学習 I: オートキュリキュラムのメリット
Authors: Nived Rajaraman, Audrey Huang, Miro Dudik, Robert Schapire, Dylan J. Foster, Akshay Krishnamurthy,
Abstract要約: 本稿では,教師付き微調整学習と強化学習の両方において,モデルが独自のパフォーマンスを用いてトレーニングに集中すべき問題を決定するオートカリキュラムについて述べる。 SFTでは,教師の指導を現在のモデルが苦しむプロンプトに焦点を合わせることで,非適応的な微調整よりも指数関数的に推論デモを少なくすることを示した。 RLファインチューニングでは、オートキュリキュラムは参照モデルの品質から計算コストを分離し、後者を目標精度にほぼ依存しないバーンインコストに削減する。
参考スコア（独自算出の注目度）: 44.56791874493398
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chain-of-thought reasoning, where language models expend additional computation by producing thinking tokens prior to final responses, has driven significant advances in model capabilities. However, training these reasoning models is extremely costly in terms of both data and compute, as it involves collecting long traces of reasoning behavior from humans or synthetic generators and further post-training the model via reinforcement learning. Are these costs fundamental, or can they be reduced through better algorithmic design? We show that autocurriculum, where the model uses its own performance to decide which problems to focus training on, provably improves upon standard training recipes for both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we show that autocurriculum requires exponentially fewer reasoning demonstrations than non-adaptive fine-tuning, by focusing teacher supervision on prompts where the current model struggles. For RL fine-tuning, autocurriculum decouples the computational cost from the quality of the reference model, reducing the latter to a burn-in cost that is nearly independent of the target accuracy. These improvements arise purely from adaptive data selection, drawing on classical techniques from boosting and learning from counterexamples, and requiring no assumption on the distribution or difficulty of prompts.
Abstract（参考訳）: 最終的な応答に先立って思考トークンを生成することで、言語モデルがさらなる計算を行うという、思考の連鎖推論は、モデル機能に大きな進歩をもたらした。しかしながら、これらの推論モデルのトレーニングは、人間や合成発電機からの推論行動の長い痕跡を収集し、強化学習を通じてモデルをさらに訓練することを含むため、データと計算の両方の観点から非常にコストがかかる。これらのコストは基本的なものなのか、それともより優れたアルゴリズム設計によって削減できるのか? 本稿では,教師付き微調整 (SFT) と強化学習 (RL) の双方において,モデルが独自の性能を用いてトレーニングに焦点を合わせるべき問題を決定するオートカリキュラムについて述べる。 SFTでは,教師の指導を現在のモデルが苦しむプロンプトに焦点を合わせることで,非適応的な微調整よりも指数関数的に推論デモを少なくすることを示した。 RLファインチューニングでは、オートキュリキュラムは参照モデルの品質から計算コストを分離し、後者を目標精度にほぼ依存しないバーンインコストに削減する。これらの改善は、純粋に適応的なデータ選択、反例からの強化と学習、プロンプトの分布や難易度を仮定する必要のない古典的なテクニックに基づくものである。

論文の概要: Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum

関連論文リスト