Vision-Language Foundation Models as Effective Robot Imitators
- URL: http://arxiv.org/abs/2311.01378v3
- Date: Mon, 5 Feb 2024 03:46:00 GMT
- Title: Vision-Language Foundation Models as Effective Robot Imitators
- Authors: Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu,
Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, Hang Li, Tao Kong
- Abstract summary: We derive a vision-language manipulation framework, dubbed RoboFlamingo, built upon the open-source VLMs, OpenFlamingo.
By exceeding the state-of-the-art performance with a large margin on the tested benchmark, we show RoboFlamingo can be an effective and competitive alternative to adapt VLMs to robot control.
- Score: 48.73027330407576
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent progress in vision language foundation models has shown their ability
to understand multimodal data and resolve complicated vision language tasks,
including robotics manipulation. We seek a straightforward way of making use of
existing vision-language models (VLMs) with simple fine-tuning on robotics
data. To this end, we derive a simple and novel vision-language manipulation
framework, dubbed RoboFlamingo, built upon the open-source VLMs, OpenFlamingo.
Unlike prior works, RoboFlamingo utilizes pre-trained VLMs for single-step
vision-language comprehension, models sequential history information with an
explicit policy head, and is slightly fine-tuned by imitation learning only on
language-conditioned manipulation datasets. Such a decomposition provides
RoboFlamingo the flexibility for open-loop control and deployment on
low-performance platforms. By exceeding the state-of-the-art performance with a
large margin on the tested benchmark, we show RoboFlamingo can be an effective
and competitive alternative to adapt VLMs to robot control. Our extensive
experimental results also reveal several interesting conclusions regarding the
behavior of different pre-trained VLMs on manipulation tasks. We believe
RoboFlamingo has the potential to be a cost-effective and easy-to-use solution
for robotics manipulation, empowering everyone with the ability to fine-tune
their own robotics policy.
Related papers
- $π_0$: A Vision-Language-Action Flow Model for General Robot Control [77.32743739202543]
We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge.
We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people, and its ability to acquire new skills via fine-tuning.
arXiv Detail & Related papers (2024-10-31T17:22:30Z) - KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data [45.25288643161976]
We propose Keypoint Affordance Learning from Imagined Environments (KALIE) for robotic control in a scalable manner.
Instead of directly producing motor commands, KALIE controls the robot by predicting point-based affordance representations.
We demonstrate that KALIE can learn to robustly solve new manipulation tasks with unseen objects given only 50 example data points.
arXiv Detail & Related papers (2024-09-21T08:45:16Z) - Solving Robotics Problems in Zero-Shot with Vision-Language Models [0.0]
We introduce Wonderful Team, a multi-agent Vision Large Language Model (VLLM) framework designed to solve robotics problems in a zero-shot regime.
In our context, zero-shot means that for a novel environment, we provide a VLLM with an image of the robot's surroundings and a task description.
Our system showcases the ability to handle diverse tasks such as manipulation, goal-reaching, and visual reasoning -- all in a zero-shot manner.
arXiv Detail & Related papers (2024-07-26T21:18:57Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
Vision Language Models (VLMs) can process state information as visual-textual prompts and respond with policy decisions in text.
We propose LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as conversations.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments.
We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model.
Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z) - RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics [46.63773228934993]
We introduce an automatic synthetic data generation pipeline that instruction-tunes vision language models (VLMs) to robotic domains and needs.
Using the pipeline, we train RoboPoint, a VLM that predicts image keypoint affordances given language instructions.
Our experiments demonstrate that RoboPoint outperforms state-of-the-art VLMs by 21.8% in the accuracy of predicting spatial affordance and by 30.5% in the success rate of downstream tasks.
arXiv Detail & Related papers (2024-06-15T19:22:51Z) - MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions.
Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world.
We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - What Matters in Language Conditioned Robotic Imitation Learning [26.92329260907805]
We study the most critical challenges in learning language conditioned policies from offline free-form imitation datasets.
We present a novel approach that significantly outperforms the state of the art on the challenging language conditioned long-horizon robot manipulation CALVIN benchmark.
arXiv Detail & Related papers (2022-04-13T08:45:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.