Fugu-MT 論文翻訳(概要): Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

論文の概要: Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

arxiv url: http://arxiv.org/abs/2606.02194v1
Date: Mon, 01 Jun 2026 12:49:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:32.073345
Title: Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards
Title（参考訳）: 学習報酬を伴う大規模行動モデルのコヒーレントオフポリティ改善
Authors: Christian Scherer, Joe Watson, Theo Gruner, Daniel Palenicek, Ingmar Posner, Jan Peters,
Abstract要約: 強化学習は、さらなる経験を用いてポリシーを微調整するために使用することができる。逆強化学習では、専門家によるデモンストレーションから高密度報酬関数が学習される。提案手法は,6つのスパース操作タスクすべてに対してpi-0.5を維持・改善し,複雑な操作タスクのうち5つに対して50q 90%の成功率を達成できることを示す。
参考スコア（独自算出の注目度）: 24.576576709809036
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Distilling expert demonstration data into large generative models using behavioral cloning is a scalable approach to learning capable policies for robotic control, particularly for dexterous manipulation. Reinforcement learning (RL) can be used as a means to finetune these policies further using additional experience. An open question is whether RL is more sample-efficient than collecting more human demonstrations. Prior work has finetuned large pretrained policies in a scalable fashion by applying RL to a smaller residual policy that corrects the pretrained model. However, for the typical sparse reward tasks, RL algorithms can struggle to optimize the behavior in a sample-efficient manner. We explore inverse reinforcement learning, where a dense reward function is learned from expert demonstrations, potentially reducing the challenge of RL finetuning. We specifically consider coherent imitation learning, an IRL method that facilitates improvement of the BC policy through using a specific reward formulation with theoretical guarantees. We show that our IRL method maintains or improves the performance of pi-0.5 on all six sparse manipulation tasks and achieves a $\geq 90\%$ success rate on five out of six complex manipulation tasks, outperforming RL-based baselines using sparse rewards. By ensuring our initial pretrained finetuning policy is optimal for our initial reward and critic, our method circumvents the initial drop commonly seen in RL finetuning and enables faster improvement.
Abstract（参考訳）: 行動クローニングを用いた大規模な生成モデルに専門家による実証データを蒸留することは、ロボット制御、特に巧妙な操作のための有能なポリシーを学ぶためのスケーラブルなアプローチである。強化学習(Reinforcement Learning, RL)は、これらのポリシーを更なる経験を用いて微調整する手段として用いられる。オープンな疑問は、RLが人間のデモを集めるよりもサンプリング効率が高いかどうかである。先行研究は、事前訓練されたモデルを補正する小さな残留ポリシーにRLを適用することにより、大規模事前学習されたポリシーをスケーラブルな方法で微調整してきた。しかし、典型的なスパース報酬タスクでは、RLアルゴリズムはサンプル効率のよい方法で振舞いを最適化するのに苦労する。本稿では,RLファインタニングの課題を軽減するために,専門家による実験から高密度報酬関数を学習する逆強化学習について検討する。本稿では,理論的保証を伴う特定の報酬の定式化を用いて,BC政策の改善を促進するIRL手法であるコヒーレント模倣学習について検討する。提案手法は6つのスパース操作タスクすべてに対してpi-0.5を維持または改善し,6つの複雑な操作タスクのうち5つに対して$\geq 90\%の成功率を達成し,スパース報酬を用いたRLベースベースラインよりも優れていることを示す。 RLファインタニングで一般的に見られる初期低下を回避し,より高速な改善を可能にする。

論文の概要: Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

関連論文リスト