Related papers: SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

URL: http://arxiv.org/abs/2405.19586v1
Date: Thu, 30 May 2024 00:32:51 GMT
Title: SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation
Authors: Junjie Zhang, Chenjia Bai, Haoran He, Wenke Xia, Zhigang Wang, Bin Zhao, Xiu Li, Xuelong Li,
Abstract summary: Segment Anything (SAM) is a vision-foundation model for generalizable scene understanding and sequence imitation. We develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass.
Score: 62.58480650443393
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction. Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot's end-effector. However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning. In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning. Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a better understanding of embodied scenarios. To address long-horizon reasoning, we develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass, notably enhancing execution efficiency. Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to the baselines, and also significantly improves generalization in few-shot adaptation to new tasks.

Related papers

AnchorDP3: 3D Affordance Guided Sparse Diffusion Policy for Robotic Manipulation [8.603450327406879]
AnchorDP3 is a diffusion policy framework for dual-arm robotic manipulation.<n>It is trained on large-scale, procedurally generated simulation data.<n>It achieves a 98.7% average success rate in the RoboTwin benchmark.
arXiv Detail & Related papers (2025-06-24T03:03:26Z)
REMAC: Self-Reflective and Self-Evolving Multi-Agent Collaboration for Long-Horizon Robot Manipulation [57.628771707989166]
We propose an adaptive multi-agent planning framework, termed REMAC, that enables efficient, scene-agnostic multi-robot long-horizon task planning and execution. ReMAC incorporates two key modules: a self-reflection module performing pre-conditions and post-condition checks in the loop to evaluate progress and refine plans, and a self-evolvement module dynamically adapting plans based on scene-specific reasoning.
arXiv Detail & Related papers (2025-03-28T03:51:40Z)
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences. Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations. Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z)
RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete [27.814422322892522]
Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various multimodal contexts. They lack three essential robotic brain capabilities: Planning Capability, Affordance Perception, and Trajectory Prediction. We introduce ShareRobot, a dataset that labels multi-dimensional information such as task planning, object affordance, and end-effector trajectory. We develop RoboBrain, an MLLM-based model that combines robotic and general multi-modal data, utilizing a multi-stage training strategy.
arXiv Detail & Related papers (2025-02-28T17:30:39Z)
3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning [2.6670748466660523]
Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks. VLMs lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations. We propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs.
arXiv Detail & Related papers (2025-02-13T02:40:19Z)
Reinforcement Learning with Action Sequence for Data-Efficient Robot Learning [62.3886343725955]
We introduce a novel RL algorithm that learns a critic network that outputs Q-values over a sequence of actions. By explicitly training the value functions to learn the consequence of executing a series of current and future actions, our algorithm allows for learning useful value functions from noisy trajectories.
arXiv Detail & Related papers (2024-11-19T01:23:52Z)
Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations [77.31328397965653]
We introduce Ag2Manip (Agent-Agnostic representations for Manipulation), a framework aimed at surmounting challenges through two key innovations. A novel agent-agnostic visual representation derived from human manipulation videos, with the specifics of embodiments obscured to enhance generalizability. An agent-agnostic action representation abstracting a robot's kinematics to a universal agent proxy, emphasizing crucial interactions between end-effector and object.
arXiv Detail & Related papers (2024-04-26T16:40:17Z)
Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks [0.0]
In this work, we focus on unsupervised vision-language--action mapping in the area of robotic manipulation. We propose a model-invariant training alternative that improves the models' performance in a simulator by up to 55%. Our work thus also sheds light on the potential benefits and limitations of using the current multimodal VAEs for unsupervised learning of robotic motion trajectories.
arXiv Detail & Related papers (2024-04-02T13:25:16Z)
Masked AutoDecoder is Effective Multi-Task Vision Generalist [64.43215311406195]
Masked AutoDecoder(MAD) is an effective multi-task vision generalist. We develop a parallel decoding framework that introduces bi-directional attention to capture contextual dependencies. Second, we design a masked sequence modeling approach that learns rich task contexts by masking and reconstructing task sequences.
arXiv Detail & Related papers (2024-03-12T14:36:52Z)
Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D Action Representation Learning [33.68311764817763]
We propose Prompted Contrast with Masked Motion Modeling, PCM$rm 3$, for versatile 3D action representation learning. Our method integrates the contrastive learning and masked prediction tasks in a mutually beneficial manner. Tests on five downstream tasks under three large-scale datasets are conducted, demonstrating the superior generalization capacity of PCM$rm 3$ compared to the state-of-the-art works.
arXiv Detail & Related papers (2023-08-08T01:27:55Z)
AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation [50.737355245505334]
We propose a novel framework for learning high-level cognitive capabilities in robot manipulation tasks. The resulting dataset AlphaBlock consists of 35 comprehensive high-level tasks of multi-step text plans and paired observation.
arXiv Detail & Related papers (2023-05-30T09:54:20Z)
PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training [25.50131893785007]
This work introduces a paradigm for pre-training a general purpose representation that can serve as a starting point for multiple tasks on a given robot. We present the Perception-Action Causal Transformer (PACT), a generative transformer-based architecture that aims to build representations directly from robot data in a self-supervised fashion. We show that finetuning small task-specific networks on top of the larger pretrained model results in significantly better performance compared to training a single model from scratch for all tasks simultaneously.
arXiv Detail & Related papers (2022-09-22T16:20:17Z)
Efficient and Robust Training of Dense Object Nets for Multi-Object Robot Manipulation [8.321536457963655]
We propose a framework for robust and efficient training of Dense Object Nets (DON) We focus on training with multi-object data instead of singulated objects, combined with a well-chosen augmentation scheme. We demonstrate the robustness and accuracy of our proposed framework on a real-world robotic grasping task.
arXiv Detail & Related papers (2022-06-24T08:24:42Z)
Goal-Conditioned End-to-End Visuomotor Control for Versatile Skill Primitives [89.34229413345541]
We propose a conditioning scheme which avoids pitfalls by learning the controller and its conditioning in an end-to-end manner. Our model predicts complex action sequences based directly on a dynamic image representation of the robot motion. We report significant improvements in task success over representative MPC and IL baselines.
arXiv Detail & Related papers (2020-03-19T15:04:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.