Fugu-MT 論文翻訳(概要): HeaPA: Difficulty-Aware Heap Sampling and On-Policy Query Augmentation for LLM Reinforcement Learning

論文の概要: HeaPA: Difficulty-Aware Heap Sampling and On-Policy Query Augmentation for LLM Reinforcement Learning

arxiv url: http://arxiv.org/abs/2601.22448v1
Date: Fri, 30 Jan 2026 01:31:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-02 18:28:15.151595
Title: HeaPA: Difficulty-Aware Heap Sampling and On-Policy Query Augmentation for LLM Reinforcement Learning
Title（参考訳）: HeaPA: LLM強化学習のための難解なヒープサンプリングとオンラインクエリ拡張
Authors: Weiqi Wang, Xin Liu, Binxuan Huang, Hejie Cui, Rongzhi Zhang, Changlong Yu, Shuowei Jin, Jingfeng Yang, Qingyu Yin, Zhengyang Wang, Zheng Li, Yifan Gao, Priyanka Nigam, Bing Yin, Lihong Li, Yangqiu Song,
Abstract要約: HeaPAは精度を継続的に改善し、少ない計算で目標性能に達する。分析の結果、これらの上昇はフロンティアに焦点を当てたサンプリングとオン政治プールの成長によるものであることが示唆された。
参考スコア（独自算出の注目度）: 78.12979615107564
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: RLVR is now a standard way to train LLMs on reasoning tasks with verifiable outcomes, but when rollout generation dominates the cost, efficiency depends heavily on which prompts you sample and when. In practice, prompt pools are often static or only loosely tied to the model's learning progress, so uniform sampling can't keep up with the shifting capability frontier and ends up wasting rollouts on prompts that are already solved or still out of reach. Existing approaches improve efficiency through filtering, curricula, adaptive rollout allocation, or teacher guidance, but they typically assume a fixed pool-which makes it hard to support stable on-policy pool growth-or they add extra teacher cost and latency. We introduce HeaPA (Heap Sampling and On-Policy Query Augmentation), which maintains a bounded, evolving pool, tracks the frontier using heap-based boundary sampling, expands the pool via on-policy augmentation with lightweight asynchronous validation, and stabilizes correlated queries through topology-aware re-estimation of pool statistics and controlled reinsertion. Across two training corpora, two training recipes, and seven benchmarks, HeaPA consistently improves accuracy and reaches target performance with fewer computations while keeping wall-clock time comparable. Our analyses suggest these gains come from frontier-focused sampling and on-policy pool growth, with the benefits becoming larger as model scale increases. Our code is available at https://github.com/horizon-rl/HeaPA.
Abstract（参考訳）: RLVRは現在、検証可能な結果の推論タスクでLLMをトレーニングする標準的な方法となっているが、ロールアウト生成がコストを支配している場合、効率はサンプリングとタイミングに大きく依存する。実際には、プロンプトプールは、しばしばモデルの学習進捗に静的または緩やかに結びついているので、一様サンプリングはシフト能力フロンティアに追いつくことができず、既に解決されているプロンプトやまだ手が届かないプロンプトのロールアウトを無駄にしてしまう。既存のアプローチは、フィルタリング、カリキュラム、適応的なロールアウトアロケーション、あるいは教師のガイダンスを通じて効率を改善するが、彼らは通常、固定プールを仮定する。本稿では,HeaPA(Heap Smpling and On-Policy Query Augmentation)を導入し,バウンダリベースのバウンダリサンプリングによるフロンティアの追跡,軽量非同期バリデーションによるオン・ポリケーションによるプールの拡張,およびトポロジに配慮したプール統計の再推定と制御再セレーションによる相関クエリの安定化について紹介する。 2つのトレーニングコーパス、2つのトレーニングレシピ、7つのベンチマークで、HeaPAは一貫して精度を改善し、ウォールクロック時間に匹敵する時間を保ちながら、少ない計算で目標のパフォーマンスに達する。分析の結果,モデルスケールの増加に伴い,フロンティアに着目したサンプリングとオン・ポリケーションプールの成長により,これらの利益が増大することが示唆された。私たちのコードはhttps://github.com/Horizon-rl/HeaPA.comで公開されています。

論文の概要: HeaPA: Difficulty-Aware Heap Sampling and On-Policy Query Augmentation for LLM Reinforcement Learning

関連論文リスト