Related papers: EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

URL: http://arxiv.org/abs/2305.15021v2
Date: Wed, 13 Sep 2023 23:46:22 GMT
Title: EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
Authors: Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, Ping Luo
Abstract summary: Embodied AI is capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments. In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI. Experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering.
Score: 95.37585041654535
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Embodied AI is a crucial frontier in robotics, capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments. In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities. To achieve this, we have made the following efforts: (i) We craft a large-scale embodied planning dataset, termed EgoCOT. The dataset consists of carefully selected videos from the Ego4D dataset, along with corresponding high-quality language instructions. Specifically, we generate a sequence of sub-goals with the "Chain of Thoughts" mode for effective embodied planning. (ii) We introduce an efficient training approach to EmbodiedGPT for high-quality plan generation, by adapting a 7B large language model (LLM) to the EgoCOT dataset via prefix tuning. (iii) We introduce a paradigm for extracting task-related features from LLM-generated planning queries to form a closed loop between high-level planning and low-level control. Extensive experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering. Notably, EmbodiedGPT significantly enhances the success rate of the embodied control task by extracting more effective features. It has achieved a remarkable 1.6 times increase in success rate on the Franka Kitchen benchmark and a 1.3 times increase on the Meta-World benchmark, compared to the BLIP-2 baseline fine-tuned with the Ego4D dataset.

Related papers

Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction [41.63965006043724]
Visual Planning for Assistance (VPA) aims to predict a sequence of user actions required to achieve a specified goal based on a video showing the user's progress.<n>Recent advances in multimodal large language models (MLLMs) have shown promising results in video understanding.<n>We identify two challenges in training large MLLMs for video-based planning tasks.
arXiv Detail & Related papers (2025-07-20T21:39:05Z)
Reinforced Reasoning for Embodied Planning [18.40186665383579]
Embodied planning requires agents to make coherent multi-step decisions based on dynamic visual observations and natural language goals.<n>We introduce a reinforcement fine-tuning framework that brings R1-style reasoning enhancement into embodied planning.
arXiv Detail & Related papers (2025-05-28T07:21:37Z)
World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning [60.100794160682646]
We propose a new learning framework that jointly optimize state prediction and action selection through preference learning. To automatically collect trajectories and stepwise preference data without human annotation, we introduce a tree search mechanism for extensive exploration via trial-and-error. Our method significantly outperforms existing methods and GPT-4o when applied to Qwen2-VL (7B), LLaVA-1.6 (7B), and LLaMA-3.2 (11B)
arXiv Detail & Related papers (2025-03-13T15:49:56Z)
EXGRA-MED: Extended Context Graph Alignment for Medical Vision- Language Models [69.40730368630003]
We introduce EXGRA-MED, a novel framework for vision-language integration in medical AI.<n>It jointly aligns images, instruction responses, and extended captions in the latent space, advancing semantic grounding and cross-modal coherence.<n>It matches LLAVA-MED's performance using just 10% of pre-training data, achieving a 20.13% gain on VQA-RAD and approaching full-data performance.
arXiv Detail & Related papers (2024-10-03T15:52:03Z)
Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos [48.15438373870542]
VidAssist is an integrated framework designed for zero/few-shot goal-oriented planning in instructional videos. It employs a breadth-first search algorithm for optimal plan generation. Experiments demonstrate that VidAssist offers a unified framework for different goal-oriented planning setups.
arXiv Detail & Related papers (2024-09-30T17:57:28Z)
Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning [94.76546523689113]
We introduce CodePlan, a framework that generates and follows textcode-form plans -- pseudocode that outlines high-level, structured reasoning processes. CodePlan effectively captures the rich semantics and control flows inherent to sophisticated reasoning tasks. It achieves a 25.1% relative improvement compared with directly generating responses.
arXiv Detail & Related papers (2024-09-19T04:13:58Z)
EMPOWER: Embodied Multi-role Open-vocabulary Planning with Online Grounding and Execution [2.2369578015657954]
Task planning for robots in real-life settings presents significant challenges. These challenges stem from three primary issues: the difficulty in identifying grounded sequences of steps to achieve a goal, the lack of a standardized mapping between high-level actions and low-level commands, and the challenge of maintaining low computational overhead given the limited resources of robotic hardware. We introduce EMPOWER, a framework designed for open-vocabulary online grounding and planning for embodied agents aimed at addressing these issues.
arXiv Detail & Related papers (2024-08-30T16:15:28Z)
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning [79.38140606606126]
We propose an algorithmic framework that fine-tunes vision-language models (VLMs) with reinforcement learning (RL) Our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning. We demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks.
arXiv Detail & Related papers (2024-05-16T17:50:19Z)
Integrating Intent Understanding and Optimal Behavior Planning for Behavior Tree Generation from Human Instructions [5.31484618181979]
Behavior Tree (BT) is an appropriate control architecture for robots executing tasks following human instructions. This paper proposes a two-stage framework for BT generation, which first employs large language models to interpret goals from high-level instructions. We represent goals as well-formed formulas in first-order logic, effectively bridging intent understanding and optimal behavior planning.
arXiv Detail & Related papers (2024-05-13T05:23:48Z)
Socratic Planner: Inquiry-Based Zero-Shot Planning for Embodied Instruction Following [17.608330952846075]
Embodied Instruction Following (EIF) is the task of executing natural language instructions by navigating and interacting with objects in 3D environments. One of the primary challenges in EIF is compositional task planning, which is often addressed with supervised or in-context learning with labeled data. We introduce the Socratic Planner, the first zero-shot planning method that infers without the need for any training data.
arXiv Detail & Related papers (2024-04-21T08:10:20Z)
Consolidating Trees of Robotic Plans Generated Using Large Language Models to Improve Reliability [6.4111574364474215]
The inherent probabilistic nature of Large Language Models (LLMs) introduces an element of unpredictability. This paper introduces an innovative approach aims to generate correct and optimal robotic task plans for diverse real-world demands and scenarios.
arXiv Detail & Related papers (2024-01-15T18:01:59Z)
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning [84.6451394629312]
We introduce EgoPlan-Bench, a benchmark to evaluate the planning abilities of MLLMs in real-world scenarios. We show that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning. We also present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
arXiv Detail & Related papers (2023-12-11T03:35:58Z)
AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation [50.737355245505334]
We propose a novel framework for learning high-level cognitive capabilities in robot manipulation tasks. The resulting dataset AlphaBlock consists of 35 comprehensive high-level tasks of multi-step text plans and paired observation.
arXiv Detail & Related papers (2023-05-30T09:54:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.