Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation
- URL: http://arxiv.org/abs/2405.17418v1
- Date: Mon, 27 May 2024 17:58:48 GMT
- Title: Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation
- Authors: Jiaming Liu, Chenxuan Li, Guanqun Wang, Lily Lee, Kaichen Zhou, Sixiang Chen, Chuyan Xiong, Jiaxin Ge, Renrui Zhang, Shanghang Zhang,
- Abstract summary: Multimodal Large Language Models (MLLMs) have shown promise in visual instruction following.
We introduce a Self-Corrected (SC)-MLLM, equipping our model not only to predict end-effector poses but also to autonomously recognize and correct failure actions.
SC-MLLM agent significantly improve manipulation accuracy compared to previous state-of-the-art robotic MLLM (ManipLLM)
- Score: 30.54275273155153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Robot manipulation policies have shown unsatisfactory action performance when confronted with novel task or object instances. Hence, the capability to automatically detect and self-correct failure action is essential for a practical robotic system. Recently, Multimodal Large Language Models (MLLMs) have shown promise in visual instruction following and demonstrated strong reasoning abilities in various tasks. To unleash general MLLMs as an end-to-end robotic agent, we introduce a Self-Corrected (SC)-MLLM, equipping our model not only to predict end-effector poses but also to autonomously recognize and correct failure actions. Specifically, we first conduct parameter-efficient fine-tuning to empower MLLM with pose prediction ability, which is reframed as a language modeling problem. When facing execution failures, our model learns to identify low-level action error causes (i.e., position and rotation errors) and adaptively seeks prompt feedback from experts. Based on the feedback, SC-MLLM rethinks the current failure scene and generates the corrected actions. Furthermore, we design a continuous policy learning method for successfully corrected samples, enhancing the model's adaptability to the current scene configuration and reducing the frequency of expert intervention. To evaluate our SC-MLLM, we conduct extensive experiments in both simulation and real-world settings. SC-MLLM agent significantly improve manipulation accuracy compared to previous state-of-the-art robotic MLLM (ManipLLM), increasing from 57\% to 79\% on seen object categories and from 47\% to 69\% on unseen novel categories.
Related papers
- MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics Manipulation [52.739500459903724]
Large Language Models (LLMs) have demonstrated remarkable planning abilities across various domains, including robotics manipulation and navigation.
We propose a novel multi-agent LLM framework that distributes high-level planning and low-level control code generation across specialized LLM agents.
We evaluate our approach on nine RLBench tasks, including long-horizon tasks, and demonstrate its ability to solve robotics manipulation in a zero-shot setting.
arXiv Detail & Related papers (2024-11-26T17:53:44Z) - DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - RACER: Rich Language-Guided Failure Recovery Policies for Imitation Learning [19.023560632891467]
We propose a scalable data generation pipeline that augments expert demonstrations with failure recovery trajectories.
We then introduce Rich languAge-guided failure reCovERy (RACER), a supervisor-actor framework.
Our experimental results show that RACER outperforms the state-of-the-art Robotic View Transformer on RLbench.
arXiv Detail & Related papers (2024-09-23T02:50:33Z) - AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation [10.565956950284896]
The ability to reflect on and correct failures is crucial for robotic systems to interact stably with real-life objects.
Previous approaches have aimed to utilize Multimodal Large Language Models to enhance robotic systems.
We propose an Autonomous Interactive Correction (AIC) MLLM, which makes use of previous low-level interaction experiences to correct SE(3) pose predictions for articulated object.
arXiv Detail & Related papers (2024-06-17T13:44:53Z) - Verbalized Machine Learning: Revisiting Machine Learning with Language Models [63.10391314749408]
We introduce the framework of verbalized machine learning (VML)
VML constrains the parameter space to be human-interpretable natural language.
We empirically verify the effectiveness of VML, and hope that VML can serve as a stepping stone to stronger interpretability.
arXiv Detail & Related papers (2024-06-06T17:59:56Z) - Evaluating Uncertainty-based Failure Detection for Closed-Loop LLM Planners [10.746821861109176]
Large Language Models (LLMs) have witnessed remarkable performance as zero-shot task planners for robotic tasks.
However, the open-loop nature of previous works makes LLM-based planning error-prone and fragile.
In this work, we introduce a framework for closed-loop LLM-based planning called KnowLoop, backed by an uncertainty-based MLLMs failure detector.
arXiv Detail & Related papers (2024-06-01T12:52:06Z) - ManipLLM: Embodied Multimodal Large Language Model for Object-Centric
Robotic Manipulation [22.071450379253235]
We introduce an innovative approach for robot manipulation that leverages the robust reasoning capabilities of Multimodal Large Language Models (MLLMs)
By fine-tuning the injected adapters, we preserve the inherent common sense and reasoning ability of the MLLMs while equipping them with the ability for manipulation.
Experiments in simulator and real-world show the promising performance of ManipLLM.
arXiv Detail & Related papers (2023-12-24T06:38:11Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning [67.0609518552321]
We propose to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models.
By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner.
arXiv Detail & Related papers (2023-12-05T07:29:14Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.