Fugu-MT 論文翻訳(概要): How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning

論文の概要: How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning

arxiv url: http://arxiv.org/abs/2605.17077v1
Date: Sat, 16 May 2026 16:52:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:47.598011
Title: How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning
Title（参考訳）: ロボットに教える: 強力なロボットポリシー学習のための言語アノテーション
Authors: Bosung Kim, Ruiyi Wang, David Acuna, Jaehun Jung, Alexander Trevithick, Brandon Cui, Yejin Choi, Prithviraj Ammanabrolu,
Abstract要約: デモセグメントをVLM生成アノテーションでラベル付けする2段階のアプローチであるDeMiAnを紹介する。学習したインストラクターがタスク記述と初期シーンスナップショットをデプロイ時にタスクに適したアノテーションにマップする。 RoboCasaでは、インストラクターはタスクのみのベースラインで5ポイント成功し、タスクごとのオラクルの3ポイント以内に到達する。
参考スコア（独自算出の注目度）: 69.68882580009982
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scaling robot policy learning is bottlenecked by the cost of collecting demonstrations, while language annotations for existing demonstrations are comparatively cheap. We study language density as a lever for extracting more signal from a fixed robot or egocentric-video corpus. We introduce DeMiAn (Dense Multi-aspect Annotation), a two-stage approach that first re-labels demonstration segments with VLM-generated annotations along four complementary aspects: physical motion, scene composition, arm pose, and reasoning. A learned instructor then maps a task description and initial scene snapshot to a task-appropriate annotation at deployment, running asynchronously so generation latency is hidden behind policy execution. Across over 1M robot manipulation clips and 50K EgoVerse human-egocentric videos, DeMiAn improves both a vision-language-action policy and a video-based world-action model without collecting new demonstrations. On RoboCasa, the instructor raises success by 5 points over a task-only baseline and comes within 3 points of a per-task oracle. No fixed annotation aspect dominates across tasks, showing that selecting the right dense language matters. DeMiAn also improves composite-task and out-of-distribution performance, and shifts the compute-performance frontier in both mid-training and post-training after accounting for annotation-generation FLOPs. These results position dense re-annotation as a practical scaling lever for robot policy learning.
Abstract（参考訳）: ロボットポリシー学習のスケーリングは、デモ収集のコストによってボトルネックになり、既存のデモのための言語アノテーションは比較的安価である。固定されたロボットやエゴセントリックビデオコーパスからより多くの信号を抽出するためのレバーとしての言語密度について検討した。 DeMiAn(Dense Multi-aspect Annotation)は、VLM生成アノテーションを用いたデモセグメントを物理モーション、シーン構成、アームポーズ、推論の4つの相補的な側面に沿って再ラベルする2段階のアプローチである。学習したインストラクターは、タスク記述と初期シーンスナップショットをデプロイ時にタスクに適したアノテーションにマッピングする。 100万本以上のロボット操作クリップと50万本以上のEgoVerseの人間中心のビデオに加えて、DeMiAnは視覚言語アクションポリシーとビデオベースのワールドアクションモデルの両方を改善して、新しいデモを収集する。 RoboCasaでは、インストラクターはタスクのみのベースラインで5ポイント成功し、タスクごとのオラクルの3ポイント以内に到達する。タスク全体において固定アノテーションの側面が支配的であり、適切な高密度言語を選択することが重要であることを示している。 DeMiAnはまた、複合タスクとアウト・オブ・ディストリビューション性能を改善し、アノテーション生成FLOPを考慮に入れた後に、中級トレーニングと後級トレーニングの両方で計算性能のフロンティアをシフトする。これらの結果は、ロボットポリシー学習の実践的なスケーリングレバーとして、密集した再アノテーションを位置づけている。

論文の概要: How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning

関連論文リスト