Fugu-MT 論文翻訳(概要): Agent-Driven Autonomous Reinforcement Learning Research: Iterative Policy Improvement for Quadruped Locomotion

論文の概要: Agent-Driven Autonomous Reinforcement Learning Research: Iterative Policy Improvement for Quadruped Locomotion

arxiv url: http://arxiv.org/abs/2603.27416v1
Date: Sat, 28 Mar 2026 21:30:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:44.9479
Title: Agent-Driven Autonomous Reinforcement Learning Research: Iterative Policy Improvement for Quadruped Locomotion
Title（参考訳）: エージェント駆動型自律強化学習研究:四足歩行の反復的政策改善
Authors: Nimesh Khandelwal, Shakti S. Gupta,
Abstract要約: 本稿では, エージェント駆動型自律強化学習研究における四足歩行のケーススタディについて述べる。エージェントが実行ループの大部分を実行する間、人間はエージェントコーディング環境を通じて高レベルなディレクティブを提供した。アイザック・ラボのDHAV1 12-DoFで、70以上の実験が14の波に編成され、初期の粗い地形の走行から7回ほどの平均的な報奨を得て、最高に記録された波動12回、exp063回、速度誤差0.263回と97%のタイムアウトを2000回以上行った。
参考スコア（独自算出の注目度）: 14.484745002483258
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper documents a case study in agent-driven autonomous reinforcement learning research for quadruped locomotion. The setting was not a fully self-starting research system. A human provided high-level directives through an agentic coding environment, while an agent carried out most of the execution loop: reading code, diagnosing failures, editing reward and terrain configurations, launching and monitoring jobs, analyzing intermediate metrics, and proposing the next wave of experiments. Across more than 70 experiments organized into fourteen waves on a DHAV1 12-DoF quadruped in Isaac Lab, the agent progressed from early rough-terrain runs with mean reward around 7 to a best logged Wave 12 run, exp063, with velocity error 0.263 and 97\% timeout over 2000 iterations, independently reproduced five times across different GPUs. The archive also records several concrete autonomous research decisions: isolating PhysX deadlocks to terrain sets containing boxes and stair-like primitives, porting four reward terms from openly available reference implementations \cite{deeprobotics, rlsar}, correcting Isaac Sim import and bootstrapping issues, reducing environment count for diagnosis, terminating hung runs, and pivoting effort away from HIM after repeated terrain=0.0 outcomes. Relative to the AutoResearch paradigm \cite{autoresearch}, this case study operates in a more failure-prone robotics RL setting with multi-GPU experiment management and simulator-specific engineering constraints. The contribution is empirical and documentary: it shows that an agent can materially execute the iterative RL research loop in this domain with limited human intervention, while also making clear where human direction still shaped the agenda.
Abstract（参考訳）: 本稿では, エージェント駆動型自律強化学習研究における四足歩行のケーススタディについて述べる。この設定は完全な自己起動型研究システムではなかった。エージェントはコードを読み、失敗を診断し、報酬と地形の設定を編集し、ジョブの起動と監視、中間メトリクスの分析、実験の次の波の提案を行う。アイザックラボで四重奏されたDHAV1 12-DoFで、70以上の実験が14波に編成され、初期の粗いテランは7回ほどの平均的な報酬で実行され、最高ログのWave 12ランであるexp063に、ベロシティエラー0.263と97\%のタイムアウトが2000回にわたって発生し、異なるGPUで5回独立して再現された。 PhysXのデッドロックをボックスや階段のようなプリミティブを含む地形集合に分離し、オープンに利用可能な参照実装から4つの報酬項を移植する。 AutoResearchのパラダイムであるcite{autoresearch}とは対照的に、このケーススタディは、マルチGPU実験管理とシミュレータ固有のエンジニアリング制約を備えた、より障害を起こしやすいロボットRLで機能する。この貢献は実証的でドキュメンタリーであり、エージェントがこの領域で人間の介入を限定して反復的なRL研究ループを実際に実行し、人間の指示がまだアジェンダを形作っているかを明らかにしている。

論文の概要: Agent-Driven Autonomous Reinforcement Learning Research: Iterative Policy Improvement for Quadruped Locomotion

関連論文リスト