Fugu-MT 論文翻訳(概要): Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates

論文の概要: Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates

arxiv url: http://arxiv.org/abs/2605.11020v1
Date: Sun, 10 May 2026 15:32:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.307078
Title: Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates
Title（参考訳）: 信頼領域逆強化学習 : 地域政策更新による2段階の明示的上昇
Authors: Anish Diwan, Davide Tateo, Christopher E. Mower, Haitham Bou-Ammar, Jan Peters, Oleg Arenz,
Abstract要約: 逆強化学習(IRL)は通常、専門家軌道の分布に一致するエントロピーの最大化として定式化される。本研究では,各イテレーションでRL問題を解くことなく,報酬関数とポリシーの単調な改善を可能にすることにより,ギャップを埋める。提案アルゴリズムであるTrust Region Inverse Reinforcement Learning (TRIRL) は,複数の課題にまたがる最先端の模倣学習手法を,クラスタリング間平均で2.4倍の性能で上回る。
参考スコア（独自算出の注目度）: 25.957276792858085
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Inverse reinforcement learning (IRL) is typically formulated as maximizing entropy subject to matching the distribution of expert trajectories. Classical (dual-ascent) IRL guarantees monotonic performance improvement but requires fully solving an RL problem each iteration to compute dual gradients. More recent adversarial methods avoid this cost at the expense of stability and monotonic dual improvement, by directly optimizing the primal problem and using a discriminator to provide rewards. In this work, we bridge the gap between these approaches by enabling monotonic improvement of the reward function and policy without having to fully solve an RL problem at every iteration. Our key theoretical insight is that a trust-region-optimal policy for a reward function update can be globally optimal for a smaller update in the same direction. This smaller update allows us to explicitly optimize the dual objective while only relying on a local search around the current policy. In doing so, our approach avoids the training instabilities of adversarial methods, offers monotonic performance improvement, and learns a reward function in the traditional sense of IRL--one that can be globally optimized to match expert demonstrations. Our proposed algorithm, Trust Region Inverse Reinforcement Learning (TRIRL), outperforms state-of-the-art imitation learning methods across multiple challenging tasks by a factor of 2.4x in terms of aggregate inter-quartile mean, while recovering reward functions that generalize to system dynamics shifts.
Abstract（参考訳）: 逆強化学習(IRL)は通常、専門家軌道の分布に一致するエントロピーの最大化として定式化される。古典的な(デュアルアセットな)IRLは単調性能の向上を保証するが、二重勾配を計算するには各イテレーションでRL問題を解く必要がある。より最近の敵対的手法は、主問題を直接最適化し、識別器を使用して報酬を提供することにより、安定性と単調な二重改善を犠牲にして、このコストを回避している。本研究では,各イテレーションでRL問題を解くことなく,報酬関数とポリシーの単調な改善を可能にすることによって,これらのアプローチのギャップを埋める。我々の重要な理論的洞察は、報酬関数更新のための信頼領域最適化ポリシーが、同じ方向に小さな更新を行うのに、グローバルに最適であるということである。この小さなアップデートにより、現在のポリシーに関するローカル検索にのみ依存しながら、二重目的を明示的に最適化することができます。そこで本手法は, 従来のIRLの知識を活かした報酬関数を学習し, 提案手法のトレーニングの不安定さを回避し, 単調な性能向上を実現している。提案アルゴリズムであるTrust Region Inverse Reinforcement Learning (TRIRL) は、システムダイナミクスのシフトを一般化する報酬関数を回復しつつ、複数の課題にまたがる最先端の模倣学習手法より2.4倍の精度で性能を向上する。

論文の概要: Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates

関連論文リスト