Fugu-MT 論文翻訳(概要): OHP-RL: Online Human Preference as Guidance in Reinforcement Learning for Robot Manipulation

論文の概要: OHP-RL: Online Human Preference as Guidance in Reinforcement Learning for Robot Manipulation

arxiv url: http://arxiv.org/abs/2605.15971v1
Date: Fri, 15 May 2026 14:02:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-18 17:44:16.334593
Title: OHP-RL: Online Human Preference as Guidance in Reinforcement Learning for Robot Manipulation
Title（参考訳）: OHP-RL:ロボット操作のための強化学習における指導としてのオンライン人選
Authors: Yunyang Mo, Jian Li, Qiwei Wu, Yihang Kang, Renjing Xu,
Abstract要約: Online Human Preference as Guidance in Reinforcement Learning (OHP-RL) は、政策学習の指針となる選好情報として人間の介入を利用するフレームワークである。 OHP-RLは、強い成功率、より高速な収束、そして従来のアプローチよりもはるかに低い人間の介入努力を一貫して達成する。
参考スコア（独自算出の注目度）: 16.28822074948203
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While reinforcement learning (RL) enables robots to acquire skills autonomously, its real-world deployment is severely limited by inefficient and unsafe exploration. Human-in-the-loop interventions offer a practical solution, yet existing methods typically exploit these interventions as auxiliary training signals, without fully capturing the richer information they provide about when and how autonomy should be guided. Human interventions often encode relative preferences over behavior under safety and task constraints, rather than prescribing exact actions to imitate. Motivated by this perspective, we propose Online Human Preference as Guidance in Reinforcement Learning (OHP-RL), a framework that leverages human interventions as preference information to guide policy learning. OHP-RL introduces a state-dependent preference gate that adaptively regulates when and to what extent human interventions should shape policy learning. This design enables the agent to benefit from intermittent and imperfect human feedback while preserving autonomous exploration and stable policy optimization. We evaluate OHP-RL on three challenging real-world contact-rich manipulation tasks on a Franka robot. Across all tasks, OHP-RL consistently achieves strong success rates, faster convergence, and substantially lower human intervention effort than prior approaches. Moreover, the learned policies exhibit more stable and human-aligned behavior throughout training.
Abstract（参考訳）: 強化学習(RL)はロボットが自律的にスキルを習得することを可能にするが、現実の展開は非効率で安全でない探索によって著しく制限される。人道への介入は実践的な解決策を提供するが、既存の手法は一般的にこれらの介入を補助的な訓練信号として活用する。人間の介入はしばしば、模倣する正確な行動を規定するのではなく、安全とタスク制約の下での行動に対する相対的な嗜好を符号化する。この観点から,政策学習の指導に人的介入を優先情報として活用するフレームワークであるOHP-RL(Online Human Preference as Guidance in Reinforcement Learning)を提案する。 OHP-RLは状態依存の優先ゲートを導入し、人間の介入が政策学習を形成するべき時期と程度を適応的に規制する。この設計により、エージェントは、自律的な探索と安定した政策最適化を維持しながら、断続的で不完全な人間のフィードバックの恩恵を受けることができる。我々は,Frankaロボット上での3つの実世界のコンタクトリッチな操作課題に対して,OHP-RLを評価した。すべてのタスクにおいて、OHP-RLは、強い成功率、より高速な収束、そして従来のアプローチよりもはるかに低い人間の介入努力を一貫して達成する。さらに、学習方針は、トレーニング全体を通してより安定し、人間に沿った行動を示す。

論文の概要: OHP-RL: Online Human Preference as Guidance in Reinforcement Learning for Robot Manipulation

関連論文リスト