Fugu-MT 論文翻訳(概要): PROWL: Prioritized Regret-Driven Optimization for World Model Learning

論文の概要: PROWL: Prioritized Regret-Driven Optimization for World Model Learning

arxiv url: http://arxiv.org/abs/2605.18803v1
Date: Mon, 11 May 2026 14:24:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 21:37:32.342226
Title: PROWL: Prioritized Regret-Driven Optimization for World Model Learning
Title（参考訳）: PROWL:世界モデル学習のための優先順位付き回帰駆動最適化
Authors: Ahmet H. Güzel, Jenny Seidenschwarz, Benjamin Graham, Jonathan Sadeghi, Jeffrey Hawke, Jack Parker-Holder, Ilija Bogunovic,
Abstract要約: 我々は,拡散に基づく世界モデルの高次軌道を公開するための政策を訓練する,KL制約の逆行カリキュラムを導入する。提案手法をMineRLフレームワークで実装し, 既設のアウト・オブ・ディストリビューション・トラジェクトリで評価する。
参考スコア（独自算出の注目度）: 20.10187986360715
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern action-conditioned video world models achieve strong short-horizon visual realism, yet remain unreliable on rare, interaction-critical transitions that dominate downstream planning and policy performance. Because passive demonstration data systematically under-samples these high-impact regimes, improving robustness requires actively eliciting model failures rather than relying on their natural occurrence. We introduce a KL-constrained adversarial curriculum in which a policy is trained to expose high-error trajectories of a diffusion-based world model while remaining close to the behavior distribution. The world model is continuously fine-tuned on these adversarially discovered trajectories, yielding an adversarial training loop that converts rare failures into a stable, near-distribution training signal without drifting into out-of-distribution exploitation. To maintain pressure on unresolved weaknesses as the model improves, we propose a Prioritized Adversarial Trajectory (PAT) buffer that re-ranks trajectories based on prediction error, action fidelity, and learning progress, focusing training on unresolved failure modes rather than repeatedly revisiting solved cases. We implement our approach in the MineRL framework and evaluate it on held-out out-of-distribution trajectories; PROWL improves robustness over models trained on passive data alone, reveals reward-hacking behaviors under weak behavioral constraints, and demonstrates that effective adversarial world-model training critically depends on balancing exploratory failure discovery with explicit behavioral regularization. Our results suggest that scalable world models benefit not only from larger datasets, but also from selectively generating informative training data.
Abstract（参考訳）: 現代のアクション条件付きビデオワールドモデルは、強力な短期的視覚リアリズムを実現するが、下流の計画と政策パフォーマンスを支配している稀な相互作用クリティカルな遷移には信頼できない。受動的実証データは、これらの高インパクトな体制を体系的にアンダーサンプリングするため、堅牢性を改善するには、自然発生に頼るのではなく、モデル障害を積極的に引き出す必要がある。我々は,拡散型世界モデルの高誤差軌道を公開するために政策を訓練し,行動分布に近づきながら,KL制約付き逆行カリキュラムを導入する。世界モデルは、これらの逆向きに発見された軌道上で連続的に微調整され、まれな故障を分布外悪用に流すことなく、安定したほぼ分布に近い訓練信号に変換する逆方向の訓練ループが生成される。モデルが改善するにつれて未解決の弱点に対する圧力を抑えるため,予測誤差,行動忠実度,学習進捗度に基づいてトラジェクトリを再ランクする優先的逆トラジェクトリ(PAT)バッファを提案する。我々は、MineRLフレームワークにアプローチを導入し、それを評価する。POWLは受動的データだけで訓練されたモデルよりも頑健さを向上し、弱い行動制約下での報酬ハッキング行動を明らかにし、効果的に敵対的な世界モデルトレーニングは、明示的な行動規則化による探索的失敗発見のバランスに重大な依存があることを実証する。この結果から,スケーラブルな世界モデルは,大規模データセットだけでなく,情報的トレーニングデータの生成にも有用であることが示唆された。

論文の概要: PROWL: Prioritized Regret-Driven Optimization for World Model Learning

関連論文リスト