SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation
- URL: http://arxiv.org/abs/2405.19586v1
- Date: Thu, 30 May 2024 00:32:51 GMT
- Title: SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation
- Authors: Junjie Zhang, Chenjia Bai, Haoran He, Wenke Xia, Zhigang Wang, Bin Zhao, Xiu Li, Xuelong Li,
- Abstract summary: Segment Anything (SAM) is a vision-foundation model for generalizable scene understanding and sequence imitation.
We develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass.
- Score: 62.58480650443393
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction. Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot's end-effector. However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning. In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning. Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a better understanding of embodied scenarios. To address long-horizon reasoning, we develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass, notably enhancing execution efficiency. Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to the baselines, and also significantly improves generalization in few-shot adaptation to new tasks.
Related papers
- Reinforcement Learning with Action Sequence for Data-Efficient Robot Learning [62.3886343725955]
We introduce a novel RL algorithm that learns a critic network that outputs Q-values over a sequence of actions.
By explicitly training the value functions to learn the consequence of executing a series of current and future actions, our algorithm allows for learning useful value functions from noisy trajectories.
arXiv Detail & Related papers (2024-11-19T01:23:52Z) - Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations [77.31328397965653]
We introduce Ag2Manip (Agent-Agnostic representations for Manipulation), a framework aimed at surmounting challenges through two key innovations.
A novel agent-agnostic visual representation derived from human manipulation videos, with the specifics of embodiments obscured to enhance generalizability.
An agent-agnostic action representation abstracting a robot's kinematics to a universal agent proxy, emphasizing crucial interactions between end-effector and object.
arXiv Detail & Related papers (2024-04-26T16:40:17Z) - Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks [0.0]
In this work, we focus on unsupervised vision-language--action mapping in the area of robotic manipulation.
We propose a model-invariant training alternative that improves the models' performance in a simulator by up to 55%.
Our work thus also sheds light on the potential benefits and limitations of using the current multimodal VAEs for unsupervised learning of robotic motion trajectories.
arXiv Detail & Related papers (2024-04-02T13:25:16Z) - Masked AutoDecoder is Effective Multi-Task Vision Generalist [64.43215311406195]
Masked AutoDecoder(MAD) is an effective multi-task vision generalist.
We develop a parallel decoding framework that introduces bi-directional attention to capture contextual dependencies.
Second, we design a masked sequence modeling approach that learns rich task contexts by masking and reconstructing task sequences.
arXiv Detail & Related papers (2024-03-12T14:36:52Z) - Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D
Action Representation Learning [33.68311764817763]
We propose Prompted Contrast with Masked Motion Modeling, PCM$rm 3$, for versatile 3D action representation learning.
Our method integrates the contrastive learning and masked prediction tasks in a mutually beneficial manner.
Tests on five downstream tasks under three large-scale datasets are conducted, demonstrating the superior generalization capacity of PCM$rm 3$ compared to the state-of-the-art works.
arXiv Detail & Related papers (2023-08-08T01:27:55Z) - AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot
Manipulation [50.737355245505334]
We propose a novel framework for learning high-level cognitive capabilities in robot manipulation tasks.
The resulting dataset AlphaBlock consists of 35 comprehensive high-level tasks of multi-step text plans and paired observation.
arXiv Detail & Related papers (2023-05-30T09:54:20Z) - PACT: Perception-Action Causal Transformer for Autoregressive Robotics
Pre-Training [25.50131893785007]
This work introduces a paradigm for pre-training a general purpose representation that can serve as a starting point for multiple tasks on a given robot.
We present the Perception-Action Causal Transformer (PACT), a generative transformer-based architecture that aims to build representations directly from robot data in a self-supervised fashion.
We show that finetuning small task-specific networks on top of the larger pretrained model results in significantly better performance compared to training a single model from scratch for all tasks simultaneously.
arXiv Detail & Related papers (2022-09-22T16:20:17Z) - Efficient and Robust Training of Dense Object Nets for Multi-Object
Robot Manipulation [8.321536457963655]
We propose a framework for robust and efficient training of Dense Object Nets (DON)
We focus on training with multi-object data instead of singulated objects, combined with a well-chosen augmentation scheme.
We demonstrate the robustness and accuracy of our proposed framework on a real-world robotic grasping task.
arXiv Detail & Related papers (2022-06-24T08:24:42Z) - Goal-Conditioned End-to-End Visuomotor Control for Versatile Skill
Primitives [89.34229413345541]
We propose a conditioning scheme which avoids pitfalls by learning the controller and its conditioning in an end-to-end manner.
Our model predicts complex action sequences based directly on a dynamic image representation of the robot motion.
We report significant improvements in task success over representative MPC and IL baselines.
arXiv Detail & Related papers (2020-03-19T15:04:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.