EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
- URL: http://arxiv.org/abs/2305.15021v2
- Date: Wed, 13 Sep 2023 23:46:22 GMT
- Title: EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
- Authors: Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun
Jin, Bin Wang, Jifeng Dai, Yu Qiao, Ping Luo
- Abstract summary: Embodied AI is capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments.
In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI.
Experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering.
- Score: 95.37585041654535
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Embodied AI is a crucial frontier in robotics, capable of planning and
executing action sequences for robots to accomplish long-horizon tasks in
physical environments. In this work, we introduce EmbodiedGPT, an end-to-end
multi-modal foundation model for embodied AI, empowering embodied agents with
multi-modal understanding and execution capabilities. To achieve this, we have
made the following efforts: (i) We craft a large-scale embodied planning
dataset, termed EgoCOT. The dataset consists of carefully selected videos from
the Ego4D dataset, along with corresponding high-quality language instructions.
Specifically, we generate a sequence of sub-goals with the "Chain of Thoughts"
mode for effective embodied planning. (ii) We introduce an efficient training
approach to EmbodiedGPT for high-quality plan generation, by adapting a 7B
large language model (LLM) to the EgoCOT dataset via prefix tuning. (iii) We
introduce a paradigm for extracting task-related features from LLM-generated
planning queries to form a closed loop between high-level planning and
low-level control. Extensive experiments show the effectiveness of EmbodiedGPT
on embodied tasks, including embodied planning, embodied control, visual
captioning, and visual question answering. Notably, EmbodiedGPT significantly
enhances the success rate of the embodied control task by extracting more
effective features. It has achieved a remarkable 1.6 times increase in success
rate on the Franka Kitchen benchmark and a 1.3 times increase on the Meta-World
benchmark, compared to the BLIP-2 baseline fine-tuned with the Ego4D dataset.
Related papers
- SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation [62.58480650443393]
Segment Anything (SAM) is a vision-foundation model for generalizable scene understanding and sequence imitation.
We develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass.
arXiv Detail & Related papers (2024-05-30T00:32:51Z) - Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning [79.38140606606126]
We propose an algorithmic framework that fine-tunes vision-language models (VLMs) with reinforcement learning (RL)
Our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning.
We demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks.
arXiv Detail & Related papers (2024-05-16T17:50:19Z) - Integrating Intent Understanding and Optimal Behavior Planning for Behavior Tree Generation from Human Instructions [5.31484618181979]
Behavior Tree (BT) is an appropriate control architecture for robots executing tasks following human instructions.
This paper proposes a two-stage framework for BT generation, which first employs large language models to interpret goals from high-level instructions.
We represent goals as well-formed formulas in first-order logic, effectively bridging intent understanding and optimal behavior planning.
arXiv Detail & Related papers (2024-05-13T05:23:48Z) - Socratic Planner: Inquiry-Based Zero-Shot Planning for Embodied Instruction Following [17.608330952846075]
Embodied Instruction Following (EIF) is the task of executing natural language instructions by navigating and interacting with objects in 3D environments.
One of the primary challenges in EIF is compositional task planning, which is often addressed with supervised or in-context learning with labeled data.
We introduce the Socratic Planner, the first zero-shot planning method that infers without the need for any training data.
arXiv Detail & Related papers (2024-04-21T08:10:20Z) - Consolidating Trees of Robotic Plans Generated Using Large Language
Models to Improve Reliability [6.4111574364474215]
The inherent probabilistic nature of Large Language Models (LLMs) introduces an element of unpredictability.
This paper introduces an innovative approach aims to generate correct and optimal robotic task plans for diverse real-world demands and scenarios.
arXiv Detail & Related papers (2024-01-15T18:01:59Z) - EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning [84.6451394629312]
We introduce EgoPlan-Bench, a benchmark to evaluate the planning abilities of MLLMs in real-world scenarios.
We show that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning.
We also present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
arXiv Detail & Related papers (2023-12-11T03:35:58Z) - Planning as In-Painting: A Diffusion-Based Embodied Task Planning
Framework for Environments under Uncertainty [56.30846158280031]
Task planning for embodied AI has been one of the most challenging problems.
We propose a task-agnostic method named 'planning as in-painting'
The proposed framework achieves promising performances in various embodied AI tasks.
arXiv Detail & Related papers (2023-12-02T10:07:17Z) - AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot
Manipulation [50.737355245505334]
We propose a novel framework for learning high-level cognitive capabilities in robot manipulation tasks.
The resulting dataset AlphaBlock consists of 35 comprehensive high-level tasks of multi-step text plans and paired observation.
arXiv Detail & Related papers (2023-05-30T09:54:20Z) - Multimodal Contextualized Plan Prediction for Embodied Task Completion [9.659463406886301]
Task planning is an important component of traditional robotics systems enabling robots to compose fine grained skills to perform more complex tasks.
Recent work building systems for translating natural language to executable actions for task completion in simulated embodied agents is focused on directly predicting low level action sequences.
We focus on predicting a higher level plan representation for one such embodied task completion dataset - TEACh.
arXiv Detail & Related papers (2023-05-10T22:29:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.