Fugu-MT 論文翻訳(概要): Learning Real-World Acrobatic Flight from Human Preferences

論文の概要: Learning Real-World Acrobatic Flight from Human Preferences

arxiv url: http://arxiv.org/abs/2508.18817v1
Date: Tue, 26 Aug 2025 08:56:53 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-27 17:42:38.763551
Title: Learning Real-World Acrobatic Flight from Human Preferences
Title（参考訳）: 人間の嗜好から実世界のアクロバティック飛行を学習する
Authors: Colin Merk, Ismail Geles, Jiaxu Xing, Angel Romero, Giorgia Ramponi, Davide Scaramuzza,
Abstract要約: 優先度に基づく強化学習(PbRL)により、エージェントは手動で設計された報酬関数を必要とせずに制御ポリシーを学習できる。本研究では,PbRLのアジャイルドローン制御への応用について検討し,パワーループなどの動的操作の実行に注目した。我々は、シミュレーションでポリシーを訓練し、それらを現実世界のドローンに移すことに成功し、人間の好みが動きの様式的な性質を強調する複数のアクロバティックな操作を実証した。
参考スコア（独自算出の注目度）: 25.52648336834609
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Preference-based reinforcement learning (PbRL) enables agents to learn control policies without requiring manually designed reward functions, making it well-suited for tasks where objectives are difficult to formalize or inherently subjective. Acrobatic flight poses a particularly challenging problem due to its complex dynamics, rapid movements, and the importance of precise execution. In this work, we explore the use of PbRL for agile drone control, focusing on the execution of dynamic maneuvers such as powerloops. Building on Preference-based Proximal Policy Optimization (Preference PPO), we propose Reward Ensemble under Confidence (REC), an extension to the reward learning objective that improves preference modeling and learning stability. Our method achieves 88.4% of the shaped reward performance, compared to 55.2% with standard Preference PPO. We train policies in simulation and successfully transfer them to real-world drones, demonstrating multiple acrobatic maneuvers where human preferences emphasize stylistic qualities of motion. Furthermore, we demonstrate the applicability of our probabilistic reward model in a representative MuJoCo environment for continuous control. Finally, we highlight the limitations of manually designed rewards, observing only 60.7% agreement with human preferences. These results underscore the effectiveness of PbRL in capturing complex, human-centered objectives across both physical and simulated domains.
Abstract（参考訳）: 優先度に基づく強化学習(PbRL)は、エージェントが手動で設計した報酬関数を必要とせずに制御ポリシーを学習することを可能にする。アクロバティック飛行は、複雑な力学、速い動き、そして正確な実行の重要性によって特に困難な問題を引き起こす。本研究では,PbRLのアジャイルドローン制御への応用について検討し,パワーループなどの動的操作の実行に注目した。優先順位に基づく近似ポリシー最適化(Preference PPO)に基づいて,リワード・アンサンブル(Reward Ensemble under Confidence,REC)を提案する。提案手法は, 標準優先度PPOの55.2%に対して, 形状の報酬性能の88.4%を達成している。我々は、シミュレーションでポリシーを訓練し、それらを現実世界のドローンに移すことに成功し、人間の好みが動きの様式的な性質を強調する複数のアクロバティックな操作を実証した。さらに,連続制御のための代表的 MuJoCo 環境における確率的報酬モデルの適用性を示す。最後に、手動でデザインした報酬の制限を強調し、人間の好みとの60.7%の合意のみを観察する。これらの結果は、PbRLが物理的およびシミュレートされたドメインにまたがる複雑で人間中心の目的を捕捉する効果を裏付けるものである。

論文の概要: Learning Real-World Acrobatic Flight from Human Preferences

関連論文リスト