Fugu-MT 論文翻訳(概要): Prompt Curriculum Learning for Efficient LLM Post-Training

論文の概要: Prompt Curriculum Learning for Efficient LLM Post-Training

arxiv url: http://arxiv.org/abs/2510.01135v1
Date: Wed, 01 Oct 2025 17:24:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.687386
Title: Prompt Curriculum Learning for Efficient LLM Post-Training
Title（参考訳）: 効率的なLDM後学習のためのプロンプトカリキュラム学習
Authors: Zhaolin Gao, Joongwon Kim, Wen Sun, Thorsten Joachims, Sid Wang, Richard Yuanzhe Pang, Liang Tan,
Abstract要約: 本稿では,学習価値モデルを用いて中間微分プロンプトを選択するアルゴリズムであるPrompt Curriculum Learning (PCL)を紹介する。我々は,PCLがRLの間,より困難なプロンプトに集中できることを示す。
参考スコア（独自算出の注目度）: 30.19003037220951
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We introduce Prompt Curriculum Learning (PCL), a lightweight reinforcement learning (RL) algorithm that selects intermediate-difficulty prompts using a learned value model to post-train language models. Since post-training LLMs via RL remains sensitive to batching and prompt selection strategies, we first conduct a series of systematic experiments where we (1) determine the optimal training batch size that balances generation efficiency and gradient quality and (2) establish the importance of focusing on prompts of intermediate difficulty for the policy. We build upon these results to design PCL, which identifies prompts of intermediate difficulty for the current policy in an on-policy manner by using a value model that is concurrently updated based on the current policy. By focusing on informative prompts that yield high effective ratios, PCL achieves either the highest performance or requires significantly less time to reach comparable performance to its counterparts. Compared to rollout-based filtering methods, PCL avoids costly rollouts and achieves $12.1\times$ and $16.9\times$ faster speed on identifying intermediate-difficulty prompts when training on MATH and DeepScaleR, respectively. We further demonstrate that our value model accurately predicts prompt difficulty and allows PCL to focus on progressively more challenging prompts during RL. Our results present a new methodology that delivers improved tradeoff between upper-bound performance and efficiency for reasoning-focused RL.
Abstract（参考訳）: 本稿では,RLアルゴリズムであるPrompt Curriculum Learning(PCL)を紹介し,学習後の言語モデルに学習値モデルを用いて中間微分プロンプトを選択する。 RL による後学習 LLM はバッチ処理に敏感なままであり、まず、(1) 生成効率と勾配品質のバランスをとる最適な訓練バッチサイズを決定するための一連の系統的な実験を行い、(2) 政策の中間的難易度に焦点を合わせることの重要性を確立する。我々はこれらの結果に基づいてPCLを設計し、現在の方針に基づいて同時に更新される値モデルを用いて、現在の政策の中間的困難をオン政治的に認識する。高い有効比をもたらす情報的プロンプトに焦点を合わせることで、PCLは最高性能を達成するか、それと同等の性能に達するのにかなり少ない時間を要す。ロールアウトベースのフィルタリング手法と比較して、PCLはコストのかかるロールアウトを回避し、MATHとDeepScaleRでトレーニングする際の中間微分プロンプトを高速に識別する12.1\times$と16.9\times$を達成する。さらに、我々の値モデルが即時困難を正確に予測し、PCLがRL中に徐々に挑戦的なプロンプトに集中できるようにすることを実証する。提案手法は, 推論に着目したRLにおける上界性能と効率のトレードオフを改善する手法である。

論文の概要: Prompt Curriculum Learning for Efficient LLM Post-Training

関連論文リスト