Fugu-MT 論文翻訳(概要): Model-based Offline RL via Robust Value-Aware Model Learning with Implicitly Differentiable Adaptive Weighting

論文の概要: Model-based Offline RL via Robust Value-Aware Model Learning with Implicitly Differentiable Adaptive Weighting

arxiv url: http://arxiv.org/abs/2603.08118v1
Date: Mon, 09 Mar 2026 08:59:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:15.717769
Title: Model-based Offline RL via Robust Value-Aware Model Learning with Implicitly Differentiable Adaptive Weighting
Title（参考訳）: 可微分適応重み付きロバスト値認識モデル学習によるモデルベースオフラインRL
Authors: Zhongjian Qiao, Jiafei Lyu, Boxiang Lyu, Yao Shu, Siyang Gao, Shuang Qiu,
Abstract要約: Adrial Model Learningは、モデルエクスプロイトを緩和するための理論的フレームワークを提供する。我々はtextbfImplicitly differentiable Adaptive weighting (ROMI) を用いた textbfRObust value-aware textbfModel 学習を提案する。
参考スコア（独自算出の注目度）: 26.86263818777302
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Model-based offline reinforcement learning (RL) aims to enhance offline RL with a dynamics model that facilitates policy exploration. However, \textit{model exploitation} could occur due to inevitable model errors, degrading algorithm performance. Adversarial model learning offers a theoretical framework to mitigate model exploitation by solving a maximin formulation. Within such a paradigm, RAMBO~\citep{rigter2022rambo} has emerged as a representative and most popular method that provides a practical implementation with model gradient. However, we empirically reveal that severe Q-value underestimation and gradient explosion can occur in RAMBO with only slight hyperparameter tuning, suggesting that it tends to be overly conservative and suffers from unstable model updates. To address these issues, we propose \textbf{RO}bust value-aware \textbf{M}odel learning with \textbf{I}mplicitly differentiable adaptive weighting (ROMI). Instead of updating the dynamics model with model gradient, ROMI introduces a novel robust value-aware model learning approach. This approach requires the dynamics model to predict future states with values close to the minimum Q-value within a scale-adjustable state uncertainty set, enabling controllable conservatism and stable model updates. To further improve out-of-distribution (OOD) generalization during multi-step rollouts, we propose implicitly differentiable adaptive weighting, a bi-level optimization scheme that adaptively achieves dynamics- and value-aware model learning. Empirical results on D4RL and NeoRL datasets show that ROMI significantly outperforms RAMBO and achieves competitive or superior performance compared to other state-of-the-art methods on datasets where RAMBO typically underperforms. Code is available at https://github.com/zq2r/ROMI.git.
Abstract（参考訳）: モデルベースオフライン強化学習(RL)は、ポリシー探索を容易にするダイナミックスモデルにより、オフラインRLを強化することを目的としている。しかし、 \textit{model exploitation} は必然的にモデルエラーが発生し、アルゴリズムのパフォーマンスが低下する可能性がある。逆モデル学習は、最大値の定式化を解くことによってモデル搾取を緩和する理論的枠組みを提供する。このようなパラダイムの中で、RAMBO~\citep{rigter2022rambo} はモデル勾配の実践的な実装を提供する代表的で最も一般的な方法として登場した。しかし,軽微なハイパーパラメータチューニングを施したRAMBOでは,厳密なQ値過小評価や勾配の爆発が発生することが実証的に明らかとなり,過度に保守的であり,不安定なモデル更新に悩まされることが示唆された。これらの問題に対処するために, 適応重み付け (ROMI) を用いて, 適応重み付け (ROMI) を用いて, 適応重み付け (ROMI) を学習する。モデル勾配で動的モデルを更新する代わりに、ROMIは、新しい堅牢な値認識モデル学習アプローチを導入する。このアプローチでは、動的モデルにより、スケール調整可能な状態不確実性セット内の最小Q値に近い値を持つ将来の状態を予測し、制御可能な保守性と安定したモデル更新を可能にする必要がある。複数段階のロールアウトにおけるアウト・オブ・ディストリビューション(OOD)の一般化をさらに改善するため,動的および値認識モデル学習を適応的に実現する二段階最適化方式である暗黙的に微分可能な適応重み付けを提案する。 D4RLとNeoRLデータセットの実証結果は、ROMIがRAMBOを著しく上回り、RAMBOが通常パフォーマンスの低いデータセット上の他の最先端メソッドと比較して、競争力や優れたパフォーマンスを達成することを示している。コードはhttps://github.com/zq2r/ROMI.gitで入手できる。

論文の概要: Model-based Offline RL via Robust Value-Aware Model Learning with Implicitly Differentiable Adaptive Weighting

関連論文リスト