Fugu-MT 論文翻訳(概要): A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation

論文の概要: A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation

arxiv url: http://arxiv.org/abs/2602.01067v1
Date: Sun, 01 Feb 2026 07:22:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:41.058646
Title: A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation
Title（参考訳）: ロボットマニピュレーションのための大規模行動モデルの共同学習のためのデータモダリティと戦略の体系的研究
Authors: Fanqi Lin, Kushal Arora, Jean Mercat, Haruki Nishimura, Paarth Shah, Chen Xu, Mengchao Zhang, Mark Zolotas, Maya Angeles, Owen Pfannenstiehl, Andrew Beaulieu, Jose Barreiros,
Abstract要約: 大規模行動モデルでは、模倣学習をマルチタスクロボットデータによる大規模トレーニングに拡張することで、厳密な操作能力を示している。最近の研究は、目標となるロボットデータと異種データモダリティから共同で学習するコトレーニングに依存している。本稿では,標準的な視覚言語データ,ロボット軌道用高密度言語アノテーション,クロス・エボディメント・ロボットデータ,ヒューマンビデオ,離散ロボットアクショントークンの5つのコトレーニングデータモダリティについて,大規模な実証的研究を行った。
参考スコア（独自算出の注目度）: 11.026552246133521
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large behavior models have shown strong dexterous manipulation capabilities by extending imitation learning to large-scale training on multi-task robot data, yet their generalization remains limited by the insufficient robot data coverage. To expand this coverage without costly additional data collection, recent work relies on co-training: jointly learning from target robot data and heterogeneous data modalities. However, how different co-training data modalities and strategies affect policy performance remains poorly understood. We present a large-scale empirical study examining five co-training data modalities: standard vision-language data, dense language annotations for robot trajectories, cross-embodiment robot data, human videos, and discrete robot action tokens across single- and multi-phase training strategies. Our study leverages 4,000 hours of robot and human manipulation data and 50M vision-language samples to train vision-language-action policies. We evaluate 89 policies over 58,000 simulation rollouts and 2,835 real-world rollouts. Our results show that co-training with forms of vision-language and cross-embodiment robot data substantially improves generalization to distribution shifts, unseen tasks, and language following, while discrete action token variants yield no significant benefits. Combining effective modalities produces cumulative gains and enables rapid adaptation to unseen long-horizon dexterous tasks via fine-tuning. Training exclusively on robot data degrades the visiolinguistic understanding of the vision-language model backbone, while co-training with effective modalities restores these capabilities. Explicitly conditioning action generation on chain-of-thought traces learned from co-training data does not improve performance in our simulation benchmark. Together, these results provide practical guidance for building scalable generalist robot policies.
Abstract（参考訳）: 大規模行動モデルでは、模倣学習をマルチタスクロボットデータに対する大規模トレーニングに拡張することで、厳密な操作能力を示しているが、その一般化は、不十分なロボットデータカバレッジによって制限されている。コストのかかるデータ収集を必要とせずに、このカバレッジを拡張するために、最近の研究は、目標となるロボットデータと異種データモダリティから共同で学習するコトレーニングに依存している。しかし、データモダリティと戦略の相違が政策パフォーマンスにどのように影響するかは、まだよく分かっていない。本稿では,標準的な視覚言語データ,ロボットトラジェクトリ用高密度言語アノテーション,クロスエボディメントロボットデータ,ヒューマンビデオ,単相および多相のトレーニング戦略における離散ロボットアクショントークンの5つのコトレーニングデータモダリティについて,大規模な実証的研究を行った。本研究は、4000時間に及ぶロボットと人間の操作データと5000万の視覚言語サンプルを利用して、視覚言語アクションポリシーを訓練する。 58,000のシミュレーションロールアウトと2,835の現実世界ロールアウトに対して89のポリシーを評価した。この結果から,視覚・言語・異体間ロボットデータによる協調学習は,分散シフト,未知のタスク,言語追従への一般化を著しく改善する一方で,離散アクショントークンの変種は有意な利益を得られないことが示唆された。効果的なモダリティを組み合わせることで累積ゲインを発生させ、微調整によって見つからない長い水平なタスクに迅速に適応することができる。ロボットデータ専用のトレーニングは、視覚言語モデルバックボーンの視覚言語学的理解を低下させ、効果的なモダリティとの共同トレーニングはこれらの能力を回復させる。協調学習データから得られたチェーンオブソートトレースの明示的な条件付け動作生成は,シミュレーションベンチマークでは性能が向上しない。これらの結果は、スケーラブルなジェネラリストロボットポリシーを構築するための実践的なガイダンスを提供する。

論文の概要: A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation

関連論文リスト