EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
- URL: http://arxiv.org/abs/2305.15021v2
- Date: Wed, 13 Sep 2023 23:46:22 GMT
- Title: EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
- Authors: Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun
Jin, Bin Wang, Jifeng Dai, Yu Qiao, Ping Luo
- Abstract summary: Embodied AI is capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments.
In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI.
Experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering.
- Score: 95.37585041654535
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Embodied AI is a crucial frontier in robotics, capable of planning and
executing action sequences for robots to accomplish long-horizon tasks in
physical environments. In this work, we introduce EmbodiedGPT, an end-to-end
multi-modal foundation model for embodied AI, empowering embodied agents with
multi-modal understanding and execution capabilities. To achieve this, we have
made the following efforts: (i) We craft a large-scale embodied planning
dataset, termed EgoCOT. The dataset consists of carefully selected videos from
the Ego4D dataset, along with corresponding high-quality language instructions.
Specifically, we generate a sequence of sub-goals with the "Chain of Thoughts"
mode for effective embodied planning. (ii) We introduce an efficient training
approach to EmbodiedGPT for high-quality plan generation, by adapting a 7B
large language model (LLM) to the EgoCOT dataset via prefix tuning. (iii) We
introduce a paradigm for extracting task-related features from LLM-generated
planning queries to form a closed loop between high-level planning and
low-level control. Extensive experiments show the effectiveness of EmbodiedGPT
on embodied tasks, including embodied planning, embodied control, visual
captioning, and visual question answering. Notably, EmbodiedGPT significantly
enhances the success rate of the embodied control task by extracting more
effective features. It has achieved a remarkable 1.6 times increase in success
rate on the Franka Kitchen benchmark and a 1.3 times increase on the Meta-World
benchmark, compared to the BLIP-2 baseline fine-tuned with the Ego4D dataset.
Related papers
- Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos [48.15438373870542]
VidAssist is an integrated framework designed for zero/few-shot goal-oriented planning in instructional videos.
It employs a breadth-first search algorithm for optimal plan generation.
Experiments demonstrate that VidAssist offers a unified framework for different goal-oriented planning setups.
arXiv Detail & Related papers (2024-09-30T17:57:28Z) - Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning [94.76546523689113]
We introduce CodePlan, a framework that generates and follows textcode-form plans -- pseudocode that outlines high-level, structured reasoning processes.
CodePlan effectively captures the rich semantics and control flows inherent to sophisticated reasoning tasks.
It achieves a 25.1% relative improvement compared with directly generating responses.
arXiv Detail & Related papers (2024-09-19T04:13:58Z) - EMPOWER: Embodied Multi-role Open-vocabulary Planning with Online Grounding and Execution [2.2369578015657954]
Task planning for robots in real-life settings presents significant challenges.
These challenges stem from three primary issues: the difficulty in identifying grounded sequences of steps to achieve a goal, the lack of a standardized mapping between high-level actions and low-level commands, and the challenge of maintaining low computational overhead given the limited resources of robotic hardware.
We introduce EMPOWER, a framework designed for open-vocabulary online grounding and planning for embodied agents aimed at addressing these issues.
arXiv Detail & Related papers (2024-08-30T16:15:28Z) - Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning [79.38140606606126]
We propose an algorithmic framework that fine-tunes vision-language models (VLMs) with reinforcement learning (RL)
Our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning.
We demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks.
arXiv Detail & Related papers (2024-05-16T17:50:19Z) - Integrating Intent Understanding and Optimal Behavior Planning for Behavior Tree Generation from Human Instructions [5.31484618181979]
Behavior Tree (BT) is an appropriate control architecture for robots executing tasks following human instructions.
This paper proposes a two-stage framework for BT generation, which first employs large language models to interpret goals from high-level instructions.
We represent goals as well-formed formulas in first-order logic, effectively bridging intent understanding and optimal behavior planning.
arXiv Detail & Related papers (2024-05-13T05:23:48Z) - Socratic Planner: Inquiry-Based Zero-Shot Planning for Embodied Instruction Following [17.608330952846075]
Embodied Instruction Following (EIF) is the task of executing natural language instructions by navigating and interacting with objects in 3D environments.
One of the primary challenges in EIF is compositional task planning, which is often addressed with supervised or in-context learning with labeled data.
We introduce the Socratic Planner, the first zero-shot planning method that infers without the need for any training data.
arXiv Detail & Related papers (2024-04-21T08:10:20Z) - Consolidating Trees of Robotic Plans Generated Using Large Language
Models to Improve Reliability [6.4111574364474215]
The inherent probabilistic nature of Large Language Models (LLMs) introduces an element of unpredictability.
This paper introduces an innovative approach aims to generate correct and optimal robotic task plans for diverse real-world demands and scenarios.
arXiv Detail & Related papers (2024-01-15T18:01:59Z) - EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning [84.6451394629312]
We introduce EgoPlan-Bench, a benchmark to evaluate the planning abilities of MLLMs in real-world scenarios.
We show that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning.
We also present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
arXiv Detail & Related papers (2023-12-11T03:35:58Z) - AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot
Manipulation [50.737355245505334]
We propose a novel framework for learning high-level cognitive capabilities in robot manipulation tasks.
The resulting dataset AlphaBlock consists of 35 comprehensive high-level tasks of multi-step text plans and paired observation.
arXiv Detail & Related papers (2023-05-30T09:54:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.