Fugu-MT 論文翻訳(概要): Fixing That Free Lunch: When, Where, and Why Synthetic Data Fails in Model-Based Policy Optimization

論文の概要: Fixing That Free Lunch: When, Where, and Why Synthetic Data Fails in Model-Based Policy Optimization

arxiv url: http://arxiv.org/abs/2510.01457v2
Date: Fri, 03 Oct 2025 16:23:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-06 14:21:29.921783
Title: Fixing That Free Lunch: When, Where, and Why Synthetic Data Fails in Model-Based Policy Optimization
Title（参考訳）: フリーランチの修正:モデルベースの政策最適化において合成データが機能しないのはいつ、どこで、なぜか
Authors: Brett Barkley, David Fridovich-Keil,
Abstract要約: 本稿では, モデルベースポリシー最適化(MBPO)に注目した。結果のフェールモードに対処することで、以前は達成不可能だったポリシーの改善が可能になることを示す。
参考スコア（独自算出の注目度）: 3.8532441307199963
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Synthetic data is a core component of data-efficient Dyna-style model-based reinforcement learning, yet it can also degrade performance. We study when it helps, where it fails, and why, and we show that addressing the resulting failure modes enables policy improvement that was previously unattainable. We focus on Model-Based Policy Optimization (MBPO), which performs actor and critic updates using synthetic action counterfactuals. Despite reports of strong and generalizable sample-efficiency gains in OpenAI Gym, recent work shows that MBPO often underperforms its model-free counterpart, Soft Actor-Critic (SAC), in the DeepMind Control Suite (DMC). Although both suites involve continuous control with proprioceptive robots, this shift leads to sharp performance losses across seven challenging DMC tasks, with MBPO failing in cases where claims of generalization from Gym would imply success. This reveals how environment-specific assumptions can become implicitly encoded into algorithm design when evaluation is limited. We identify two coupled issues behind these failures: scale mismatches between dynamics and reward models that induce critic underestimation and hinder policy improvement during model-policy coevolution, and a poor choice of target representation that inflates model variance and produces error-prone rollouts. Addressing these failure modes enables policy improvement where none was previously possible, allowing MBPO to outperform SAC in five of seven tasks while preserving the strong performance previously reported in OpenAI Gym. Rather than aiming only for incremental average gains, we hope our findings motivate the community to develop taxonomies that tie MDP task- and environment-level structure to algorithmic failure modes, pursue unified solutions where possible, and clarify how benchmark choices ultimately shape the conditions under which algorithms generalize.
Abstract（参考訳）: 合成データはデータ効率のよいDynaスタイルのモデルベース強化学習のコアコンポーネントであるが、性能も劣化する。そして、結果の失敗モードに対処することで、以前は達成不可能だった政策改善が可能になることを示します。本稿では, モデルに基づく政策最適化(MBPO)に注目した。 OpenAI Gymの強力な、一般化可能なサンプル効率向上の報告にもかかわらず、最近の研究は、MBPOがDeepMind Control Suite (DMC) において、モデルフリーのSoft Actor-Critic (SAC) を過小評価していることを示している。どちらのスイートもプロプリセプティブロボットとの連続的な制御を伴っているが、このシフトは7つの挑戦的なDMCタスクに急激なパフォーマンス損失をもたらし、MBPOはGymからの一般化の主張が成功を示唆するケースで失敗する。これは、評価が限定された場合、環境固有の仮定がアルゴリズム設計に暗黙的にエンコードされる方法を明らかにする。モデルポリティクスの共進化において、批判的過小評価や政策改善を阻害する力学モデルと報酬モデル間のミスマッチのスケールと、モデル分散を膨張させ、エラーを起こしやすいロールアウトを生み出すターゲット表現の貧弱な選択である。これらの障害モードに対処することで、これまで不可能だったポリシの改善が可能になり、MBPOはOpenAI Gymで報告された強力なパフォーマンスを維持しながら、7つのタスクのうち5つのタスクでSACを上回ります。我々は,段階的な平均ゲインのみを目標とするのではなく,MPPタスクレベルと環境レベル構造をアルゴリズムの障害モードに結びつける分類法を開発し,可能な限り統一的な解を追求し,ベンチマーク選択が最終的にアルゴリズムが一般化する条件をどう形成するかを明らかにすることを,我々の研究成果に期待する。

論文の概要: Fixing That Free Lunch: When, Where, and Why Synthetic Data Fails in Model-Based Policy Optimization

関連論文リスト