RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation
- URL: http://arxiv.org/abs/2406.04339v1
- Date: Thu, 6 Jun 2024 17:59:47 GMT
- Title: RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation
- Authors: Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Lily Lee, Kaichen Zhou, Pengju An, Senqiao Yang, Renrui Zhang, Yandong Guo, Shanghang Zhang,
- Abstract summary: We introduce RoboMamba, an end-to-end robotic MLLM that delivers both robotic reasoning and action capabilities.
Specifically, we first integrate the vision encoder with Mamba, aligning visual data with language embedding through co-training.
We find that once RoboMamba possesses sufficient reasoning capability, it can acquire manipulation skills with minimal fine-tuning parameters.
- Score: 38.89586890052952
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A fundamental objective in robot manipulation is to enable models to comprehend visual scenes and execute actions. Although existing robot Multimodal Large Language Models (MLLMs) can handle a range of basic tasks, they still face challenges in two areas: 1) inadequate reasoning ability to tackle complex tasks, and 2) high computational costs for MLLM fine-tuning and inference. The recently proposed state space model (SSM) known as Mamba demonstrates promising capabilities in non-trivial sequence modeling with linear inference complexity. Inspired by this, we introduce RoboMamba, an end-to-end robotic MLLM that leverages the Mamba model to deliver both robotic reasoning and action capabilities, while maintaining efficient fine-tuning and inference. Specifically, we first integrate the vision encoder with Mamba, aligning visual data with language embedding through co-training, empowering our model with visual common sense and robot-related reasoning. To further equip RoboMamba with action pose prediction abilities, we explore an efficient fine-tuning strategy with a simple policy head. We find that once RoboMamba possesses sufficient reasoning capability, it can acquire manipulation skills with minimal fine-tuning parameters (0.1\% of the model) and time (20 minutes). In experiments, RoboMamba demonstrates outstanding reasoning capabilities on general and robotic evaluation benchmarks. Meanwhile, our model showcases impressive pose prediction results in both simulation and real-world experiments, achieving inference speeds 7 times faster than existing robot MLLMs. Our project web page: https://sites.google.com/view/robomamba-web
Related papers
- $π_0$: A Vision-Language-Action Flow Model for General Robot Control [77.32743739202543]
We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge.
We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people, and its ability to acquire new skills via fine-tuning.
arXiv Detail & Related papers (2024-10-31T17:22:30Z) - MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics? [33.573056018368504]
This study introduces the first benchmark for evaluating Multimodal LLM for Robotic (MMRo) benchmark.
We identify four essential capabilities perception, task planning, visual reasoning, and safety measurement that MLLMs must possess to qualify as the robot's central processing unit.
Our findings indicate that no single model excels in all areas, suggesting that current MLLMs are not yet trustworthy enough to serve as the cognitive core for robots.
arXiv Detail & Related papers (2024-06-28T07:09:06Z) - LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments.
We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model.
Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z) - RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics [46.63773228934993]
We introduce an automatic synthetic data generation pipeline that instruction-tunes vision language models (VLMs) to robotic domains and needs.
Using the pipeline, we train RoboPoint, a VLM that predicts image keypoint affordances given language instructions.
Our experiments demonstrate that RoboPoint outperforms state-of-the-art VLMs by 21.8% in the accuracy of predicting spatial affordance and by 30.5% in the success rate of downstream tasks.
arXiv Detail & Related papers (2024-06-15T19:22:51Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - VoxPoser: Composable 3D Value Maps for Robotic Manipulation with
Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation.
We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects.
We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z) - RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation [33.10577695383743]
We propose a multi-embodiment, multi-task generalist agent for robotic manipulation called RoboCat.
This data spans a large repertoire of motor control skills from simulated and real robotic arms with varying sets of observations and actions.
With RoboCat, we demonstrate the ability to generalise to new tasks and robots, both zero-shot as well as through adaptation using only 100-1000 examples.
arXiv Detail & Related papers (2023-06-20T17:35:20Z) - Surfer: Progressive Reasoning with World Models for Robotic Manipulation [51.26109827779267]
We introduce a novel and simple robot manipulation framework, called Surfer.
Surfer treats robot manipulation as a state transfer of the visual scene, and decouples it into two parts: action and scene.
It is based on the world model, treats robot manipulation as a state transfer of the visual scene, and decouples it into two parts: action and scene.
arXiv Detail & Related papers (2023-06-20T07:06:04Z) - VIMA: General Robot Manipulation with Multimodal Prompts [82.01214865117637]
We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts.
We develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks.
We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively.
arXiv Detail & Related papers (2022-10-06T17:50:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.