Fugu-MT 論文翻訳(概要): Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective

論文の概要: Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective

arxiv url: http://arxiv.org/abs/2505.17652v1
Date: Fri, 23 May 2025 09:15:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-26 18:08:33.9508
Title: Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective
Title（参考訳）: LLM推論のための強化学習におけるサンプリング基準の再考:能力難易度アライメントの観点から
Authors: Deyang Kong, Qi Guo, Xiangyu Xi, Wei Wang, Jingang Wang, Xunliang Cai, Shikun Zhang, Wei Ye,
Abstract要約: 強化学習は、大規模言語モデルの推論能力を高める可能性を示す。既存の手法では,問題問題に基づくスケジューリングによる効率向上が試みられている。本稿では,textbfCompetence-textbfDifficulty textbfAlignment textbfSamplingを紹介する。
参考スコア（独自算出の注目度）: 27.94738910330893
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning exhibits potential in enhancing the reasoning abilities of large language models, yet it is hard to scale for the low sample efficiency during the rollout phase. Existing methods attempt to improve efficiency by scheduling problems based on problem difficulties. However, these approaches suffer from unstable and biased estimations of problem difficulty and fail to capture the alignment between model competence and problem difficulty in RL training, leading to suboptimal results. To tackle these limitations, this paper introduces \textbf{C}ompetence-\textbf{D}ifficulty \textbf{A}lignment \textbf{S}ampling (\textbf{CDAS}), which enables accurate and stable estimation of problem difficulties by aggregating historical performance discrepancies of problems. Then the model competence is quantified to adaptively select problems whose difficulty is in alignment with the model's current competence using a fixed-point system. Experimental results across a range of challenging mathematical benchmarks show that CDAS achieves great improvements in both accuracy and efficiency. CDAS attains the highest average accuracy against baselines and exhibits significant speed advantages compared to Dynamic Sampling, a competitive strategy in DAPO, which is \textbf{2.33} times slower than CDAS.
Abstract（参考訳）: 強化学習は、大規模言語モデルの推論能力を向上する可能性を示しているが、ロールアウトフェーズにおいてサンプル効率の低さのためにスケールすることは困難である。既存の手法では,問題問題に基づくスケジューリングによる効率向上が試みられている。しかし、これらの手法は、不安定で偏りのある問題難易度の推定に悩まされ、RLトレーニングにおけるモデル能力と問題難易度との整合性を捉えることができず、準最適結果をもたらす。これらの制約に対処するため,本稿では,問題の歴史的性能の相違を集計することにより,問題の難易度を正確に,安定した推定を可能にする, \textbf{C}ompetence-\textbf{D}ifficulty \textbf{A}lignment \textbf{S}ampling (\textbf{CDAS})を提案する。そして、モデルの能力を定量化し、固定点系を用いてモデルの現在の能力と整合性のある問題を適応的に選択する。実験結果から,CDASは精度と効率の両面で大きな改善を達成できることが示された。 CDASはベースラインに対する平均精度が最高に達し、DAPOの競合戦略であるDynamic Smplingと比較して大きな速度優位性を示す。

論文の概要: Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective

関連論文リスト