ManipLLM: Embodied Multimodal Large Language Model for Object-Centric
Robotic Manipulation
- URL: http://arxiv.org/abs/2312.16217v1
- Date: Sun, 24 Dec 2023 06:38:11 GMT
- Title: ManipLLM: Embodied Multimodal Large Language Model for Object-Centric
Robotic Manipulation
- Authors: Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan
Shen, Renrui Zhang, Jiaming Liu, Hao Dong
- Abstract summary: We introduce an innovative approach for robot manipulation that leverages the robust reasoning capabilities of Multimodal Large Language Models (MLLMs)
By fine-tuning the injected adapters, we preserve the inherent common sense and reasoning ability of the MLLMs while equipping them with the ability for manipulation.
Experiments in simulator and real-world show the promising performance of ManipLLM.
- Score: 22.071450379253235
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Robot manipulation relies on accurately predicting contact points and
end-effector directions to ensure successful operation. However, learning-based
robot manipulation, trained on a limited category within a simulator, often
struggles to achieve generalizability, especially when confronted with
extensive categories. Therefore, we introduce an innovative approach for robot
manipulation that leverages the robust reasoning capabilities of Multimodal
Large Language Models (MLLMs) to enhance the stability and generalization of
manipulation. By fine-tuning the injected adapters, we preserve the inherent
common sense and reasoning ability of the MLLMs while equipping them with the
ability for manipulation. The fundamental insight lies in the introduced
fine-tuning paradigm, encompassing object category understanding, affordance
prior reasoning, and object-centric pose prediction to stimulate the reasoning
ability of MLLM in manipulation. During inference, our approach utilizes an RGB
image and text prompt to predict the end effector's pose in chain of thoughts.
After the initial contact is established, an active impedance adaptation policy
is introduced to plan the upcoming waypoints in a closed-loop manner. Moreover,
in real world, we design a test-time adaptation (TTA) strategy for manipulation
to enable the model better adapt to the current real-world scene configuration.
Experiments in simulator and real-world show the promising performance of
ManipLLM. More details and demonstrations can be found at
https://sites.google.com/view/manipllm.
Related papers
- Latent Action Pretraining from Videos [156.88613023078778]
We introduce Latent Action Pretraining for general Action models (LAPA)
LAPA is an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels.
We propose a method to learn from internet-scale videos that do not have robot action labels.
arXiv Detail & Related papers (2024-10-15T16:28:09Z) - SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation [82.61572106180705]
This paper presents a unified approach using vision-language models (VLMs) to improve keypoint prediction across various garment categories.
We created a large-scale synthetic dataset using advanced simulation techniques, allowing scalable training without extensive real-world data.
Experimental results indicate that the VLM-based method significantly enhances keypoint detection accuracy and task success rates.
arXiv Detail & Related papers (2024-09-26T17:26:16Z) - Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation [30.54275273155153]
Multimodal Large Language Models (MLLMs) have shown promise in visual instruction following.
We introduce a Self-Corrected (SC)-MLLM, equipping our model not only to predict end-effector poses but also to autonomously recognize and correct failure actions.
SC-MLLM agent significantly improve manipulation accuracy compared to previous state-of-the-art robotic MLLM (ManipLLM)
arXiv Detail & Related papers (2024-05-27T17:58:48Z) - Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [65.46610405509338]
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation.
Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal.
We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z) - MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions.
Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world.
We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z) - VoxPoser: Composable 3D Value Maps for Robotic Manipulation with
Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation.
We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects.
We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.
Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.
Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Programmatically Grounded, Compositionally Generalizable Robotic
Manipulation [35.12811184353626]
We show that the conventional pretraining-finetuning pipeline for integrating semantic representations entangles the learning of domain-specific action information.
We propose a modular approach to better leverage pretrained models by exploiting the syntactic and semantic structures of language instructions.
Our model successfully disentangles action and perception, translating to improved zero-shot and compositional generalization in a variety of manipulation behaviors.
arXiv Detail & Related papers (2023-04-26T20:56:40Z) - Nonprehensile Riemannian Motion Predictive Control [57.295751294224765]
We introduce a novel Real-to-Sim reward analysis technique to reliably imagine and predict the outcome of taking possible actions for a real robotic platform.
We produce a closed-loop controller to reactively push objects in a continuous action space.
We observe that RMPC is robust in cluttered as well as occluded environments and outperforms the baselines.
arXiv Detail & Related papers (2021-11-15T18:50:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.