Fugu-MT 論文翻訳(概要): Preference-Calibrated Human-in-the-Loop Reinforcement Learning for Robotic Manipulation

論文の概要: Preference-Calibrated Human-in-the-Loop Reinforcement Learning for Robotic Manipulation

arxiv url: http://arxiv.org/abs/2606.03949v1
Date: Tue, 02 Jun 2026 17:38:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-03 22:00:05.223781
Title: Preference-Calibrated Human-in-the-Loop Reinforcement Learning for Robotic Manipulation
Title（参考訳）: ロボットマニピュレーションのためのロボット強化学習
Authors: Zeyi Liu, Guangyao Liu, Yinuo Qu, Yuquan Xue, Bofang Jia, Chunhua Yang, Weihua Gui, Keke Huang, Ziwei Wang,
Abstract要約: PACT(Preference-calibrated Actor-Critic Training framework)を提案する。まず、人間の実演から学習し、信用補正のための最適部分を特定するプログレスモデルを設計する。次に、選好ペアを構築して、同定された準最適セグメントのベルマン目標をペナルティ化する反ファクト的優位性を定義する。
参考スコア（独自算出の注目度）: 40.17737666526493
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human-in-the-loop reinforcement learning (HIL-RL) improves sample efficiency in real-robot manipulation through online human intervention. However, successful trajectories may include suboptimal actions that deviate from the desired task-execution path and force human intervention. Existing HIL-RL methods typically apply the consistent credit assignment principle to all transitions, uniformly propagating discounted terminal rewards through suboptimal segments, ignoring the actual contribution of each transition to task success. This overestimates Q-values for critic learning and indirectly misguides actor updates toward suboptimal behavior patterns. To this end, we propose PACT, a Preference-calibrated Actor-Critic Training framework that leverages the implicit preference signals induced by intervention to perform credit reassignment on identified suboptimal segments while directly guiding policy training for unbiased critic-actor learning. Specifically, we first design a progress model that learns from human demonstration and identifies suboptimal segments for credit correction. Then, from the human action and resampled policy action at the intervention state, we build preference pairs to define a counterfactual advantage that penalizes Bellman targets of the identified suboptimal segment, enabling directional credit calibration. Moreover, we directly align the policy with human corrective actions in the bounded mean space, providing an additional signal beyond critic-guided updates. Across five real-robot manipulation tasks, PACT improves the average success rate by 24.5% and achieves 1.3 times faster convergence, thereby improving both RL sample efficiency and performance. Code is available at https://anonymous.4open.science/r/HILRL-A1X-BC05.
Abstract（参考訳）: HIL-RL(Human-in-the-loop reinforcement learning)は、リアルタイムロボット操作におけるサンプル効率を改善する。しかし、成功した軌道には、望ましいタスク実行経路から逸脱し、人間の介入を強制する最適な行動が含まれるかもしれない。既存のHIL-RL法は、通常、全ての遷移に一貫した信用割当原則を適用し、各遷移がタスク成功に実際に貢献することを無視し、最適部分を通して割引された端末報酬を均一に伝播する。これにより、批判的学習のQ値が過大評価され、アクター更新を最適下行動パターンに間接的に誤認する。そこで本稿では,アクター・アクター・アクター・トレーニング・フレームワークであるPACTを提案する。このフレームワークは,アクター・アクター・ラーニングのためのポリシートレーニングを直接指導しながら,特定のサブ最適セグメントに対するクレジット再割り当てを行うために介入によって誘導される暗黙の選好信号を活用する。具体的には、まず、人間の実演から学習し、信用補正のための最適部分を特定するプログレスモデルを設計する。そして、介入状態における人的行動と再サンプリングされた政策行動から、特定された準最適セグメントのベルマン目標をペナルティ化し、指向性クレジットキャリブレーションを可能にする対物的優位性を定義するための選好ペアを構築する。さらに、我々は、ポリシーを有界平均空間における人間の是正行動と直接整合させ、批評家が指導した更新以上の信号を提供する。 5つの実ロボット操作タスクの中で、PACTは平均成功率を24.5%改善し、1.3倍の収束を実現し、RLサンプル効率と性能を改善した。コードはhttps://anonymous.4open.science/r/HILRL-A1X-BC05で公開されている。

論文の概要: Preference-Calibrated Human-in-the-Loop Reinforcement Learning for Robotic Manipulation

関連論文リスト