Fugu-MT 論文翻訳(概要): Provably Efficient Policy-Reward Co-Pretraining for Adversarial Imitation Learning

論文の概要: Provably Efficient Policy-Reward Co-Pretraining for Adversarial Imitation Learning

arxiv url: http://arxiv.org/abs/2606.22056v1
Date: Sat, 20 Jun 2026 14:14:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 23:01:55.966424
Title: Provably Efficient Policy-Reward Co-Pretraining for Adversarial Imitation Learning
Title（参考訳）: 対人模倣学習における政策回帰協調訓練の有効性
Authors: Tian Xu, Zexuan Chen, Zhilong Zhang, Yi-Chen Li, Chenyang Wang, Lei Yuan, Yang Yu,
Abstract要約: 行動模倣学習(AIL)は行動クローニング(BC)と比較して高品質な模倣を実現する最近の実証研究は、この制限に対処するために、BC が事前訓練したポリシーで AIL アルゴリズムを初期化することを検討している。本稿では,体系的な理論的解析を行い,AILの高速化のための原理的事前学習アルゴリズムを提案する。
参考スコア（独自算出の注目度）: 21.454127729966462
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adversarial imitation learning (AIL) achieves high-quality imitation compared to behavioral cloning (BC), but demands substantial online environment interaction. Recent empirical work has explored initializing AIL algorithms with BC pretrained policies to address this limitation, yet a rigorous theoretical understanding of pretraining's role in AIL remains elusive. This paper provides a systematic theoretical analysis and introduces principled pretraining algorithms for accelerating AIL. We begin by analyzing AIL with policy pretraining alone, identifying reward error as the dominant source of suboptimality. This reveals a critical and previously overlooked gap: the absence of reward pretraining. Motivated by this finding, we develop a principled policy-reward co-pretraining approach grounded in a reward shaping analysis. Our analysis uncovers a fundamental connection between expert policies and shaping rewards, which naturally gives rise to CoPT-AIL, an approach that jointly pretrains both policy and reward through a single BC procedure. We prove that CoPT-AIL achieves an improved imitation gap bound over standard AIL, establishing the first theoretical guarantee for the benefits of pretraining in AIL. Experimental results confirm CoPT-AIL's superior performance over existing AIL methods.
Abstract（参考訳）: 逆模倣学習(AIL)は行動クローニング(BC)と比較して高品質な模倣を実現するが、かなりのオンライン環境相互作用を必要とする。近年の実証研究は、この制限に対処するために、BC が事前訓練したポリシーで AIL アルゴリズムを初期化することを検討したが、AIL における事前訓練の役割に関する厳密な理論的理解はいまだ解明されていない。本稿では,体系的な理論的解析を行い,AILの高速化のための原理的事前学習アルゴリズムを提案する。まず、AILを政策事前訓練のみで分析し、報酬エラーを最適下限の主流の源とすることから始める。これは、批判的でこれまで見過ごされていたギャップ、すなわち報酬事前訓練の欠如を明らかにします。この発見に動機づけられた我々は、報酬形成分析に基づく原則的政策回帰協調訓練手法を開発した。この分析によって専門家の政策と報酬形成の基本的な関係が明らかとなり、これは自然にCoPT-AILを生じさせ、これは単一のBCの手続きを通じて政策と報酬を共同で事前訓練するアプローチである。我々は, CoPT-AIL が標準 AIL 上の改良された模倣ギャップを達成できることを証明し, AIL における事前学習の利点に関する最初の理論的保証を確立した。 CoPT-AILは既存のAIL法よりも優れた性能を示す実験結果を得た。

論文の概要: Provably Efficient Policy-Reward Co-Pretraining for Adversarial Imitation Learning

関連論文リスト