Fugu-MT 論文翻訳(概要): VIMA: General Robot Manipulation with Multimodal Prompts

論文の概要: VIMA: General Robot Manipulation with Multimodal Prompts

arxiv url: http://arxiv.org/abs/2210.03094v1
Date: Thu, 6 Oct 2022 17:50:11 GMT
ステータス: 翻訳完了
システム内更新日: 2022-10-07 15:50:13.995813
Title: VIMA: General Robot Manipulation with Multimodal Prompts
Title（参考訳）: VIMA:マルチモーダルプロンプトによる汎用ロボット操作
Authors: Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, Linxi Fan
Abstract要約: 本研究は,多モーダルなプロンプト,テキストおよび視覚トークンのインターリーブにより,ロボット操作タスクの幅広い範囲を表現できることを示唆する。我々は、これらのプロンプトを処理するトランスフォーマーベースの汎用ロボットエージェントVIMAを設計し、自動回帰的に運動動作を出力する。我々は,数千の手続き的に生成されるテーブルトップタスクにマルチモーダルプロンプト,模倣学習のための600K以上の専門トラジェクトリ,体系的な一般化のための4つの評価プロトコルを備えた新しいシミュレーションベンチマークを開発した。
参考スコア（独自算出の注目度）: 82.01214865117637
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. This work shows that we can express a wide spectrum of robot manipulation tasks with multimodal prompts, interleaving textual and visual tokens. We design a transformer-based generalist robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. To train and evaluate VIMA, we develop a new simulation benchmark with thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and four levels of evaluation protocol for systematic generalization. VIMA achieves strong scalability in both model capacity and data size. It outperforms prior SOTA methods in the hardest zero-shot generalization setting by up to $2.9\times$ task success rate given the same training data. With $10\times$ less training data, VIMA still performs $2.7\times$ better than the top competing approach. We open-source all code, pretrained models, dataset, and simulation benchmark at https://vimalabs.github.io
Abstract（参考訳）: プロンプトに基づく学習は自然言語処理において成功し、入力プロンプトによって指定されたタスクを実行するために単一の汎用言語モデルを指示することができる。しかしロボティクスにおけるタスク仕様は、ワンショットデモの模倣、言語指示の追従、視覚目標の達成など、さまざまな形態で実現されている。それらはしばしば異なるタスクと見なされ、特殊なモデルによって取り組まれる。本研究は,多モーダルなプロンプト,テキストおよび視覚トークンのインターリーブにより,ロボット操作タスクの幅広い範囲を表現できることを示唆する。我々は,これらのプロンプトを処理し,自己回帰的に運動行動を出力するトランスフォーマリストロボットエージェントvimaを設計した。 VIMAを訓練し評価するために,数千の手続き的に生成されるテーブルトップタスクにマルチモーダルプロンプト,600K以上の模倣学習専門トラジェクトリ,体系的一般化のための4レベル評価プロトコルを備えた新しいシミュレーションベンチマークを開発した。 VIMAは、モデルキャパシティとデータサイズの両方において、強力なスケーラビリティを実現する。従来のSOTA法では、同じトレーニングデータから最大2.9\times$タスク成功率で最強のゼロショット一般化設定を上回ります。 10\times$のトレーニングデータでは、vimaは依然として上位のライバルのアプローチよりも2.7\times$が優れている。私たちはhttps://vimalabs.github.ioで、すべてのコード、事前訓練されたモデル、データセット、シミュレーションベンチマークをオープンソース化しました。

論文の概要: VIMA: General Robot Manipulation with Multimodal Prompts

関連論文リスト