M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place
- URL: http://arxiv.org/abs/2311.00926v1
- Date: Thu, 2 Nov 2023 01:42:52 GMT
- Title: M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place
- Authors: Wentao Yuan, Adithyavairavan Murali, Arsalan Mousavian, Dieter Fox
- Abstract summary: M2T2 is a single model that supplies different types of low-level actions that work robustly on arbitrary objects in cluttered scenes.
M2T2 is trained on a large-scale synthetic dataset with 128K scenes and achieves zero-shot sim2real transfer on the real robot.
- Score: 44.303123422422246
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the advent of large language models and large-scale robotic datasets,
there has been tremendous progress in high-level decision-making for object
manipulation. These generic models are able to interpret complex tasks using
language commands, but they often have difficulties generalizing to
out-of-distribution objects due to the inability of low-level action
primitives. In contrast, existing task-specific models excel in low-level
manipulation of unknown objects, but only work for a single type of action. To
bridge this gap, we present M2T2, a single model that supplies different types
of low-level actions that work robustly on arbitrary objects in cluttered
scenes. M2T2 is a transformer model which reasons about contact points and
predicts valid gripper poses for different action modes given a raw point cloud
of the scene. Trained on a large-scale synthetic dataset with 128K scenes, M2T2
achieves zero-shot sim2real transfer on the real robot, outperforming the
baseline system with state-of-the-art task-specific models by about 19% in
overall performance and 37.5% in challenging scenes where the object needs to
be re-oriented for collision-free placement. M2T2 also achieves
state-of-the-art results on a subset of language conditioned tasks in RLBench.
Videos of robot experiments on unseen objects in both real world and simulation
are available on our project website https://m2-t2.github.io.
Related papers
- Details Make a Difference: Object State-Sensitive Neurorobotic Task Planning [15.03025428687218]
The state of an object reflects its current status or condition and is important for a robot's task planning and manipulation.
Recently, pre-trained Large Language Models (LLMs) and Vision-Language Models (VLMs) have shown impressive capabilities in generating plans.
We introduce an Object State-Sensitive Agent (OSSA), a task-planning agent empowered by pre-trained neural networks.
arXiv Detail & Related papers (2024-06-14T12:52:42Z) - Uncertainty-aware Active Learning of NeRF-based Object Models for Robot Manipulators using Visual and Re-orientation Actions [8.059133373836913]
This paper presents an approach that enables a robot to rapidly learn the complete 3D model of a given object for manipulation in unfamiliar orientations.
We use an ensemble of partially constructed NeRF models to quantify model uncertainty to determine the next action.
Our approach determines when and how to grasp and re-orient an object given its partial NeRF model and re-estimates the object pose to rectify misalignments introduced during the interaction.
arXiv Detail & Related papers (2024-04-02T10:15:06Z) - TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection [23.73648235283315]
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks.
Recent solutions are mainly all-in-one models.
We propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection.
arXiv Detail & Related papers (2024-03-12T22:33:02Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - GAMMA: Generalizable Articulation Modeling and Manipulation for
Articulated Objects [53.965581080954905]
We propose a novel framework of Generalizable Articulation Modeling and Manipulating for Articulated Objects (GAMMA)
GAMMA learns both articulation modeling and grasp pose affordance from diverse articulated objects with different categories.
Results show that GAMMA significantly outperforms SOTA articulation modeling and manipulation algorithms in unseen and cross-category articulated objects.
arXiv Detail & Related papers (2023-09-28T08:57:14Z) - ROAM: Robust and Object-Aware Motion Generation Using Neural Pose
Descriptors [73.26004792375556]
This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object.
We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object.
We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
arXiv Detail & Related papers (2023-08-24T17:59:51Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.
Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.
Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - ARMBench: An Object-centric Benchmark Dataset for Robotic Manipulation [9.551453254490125]
ARMBench is a large-scale, object-centric benchmark dataset for robotic manipulation in the context of a warehouse.
We present a large-scale dataset collected in an Amazon warehouse using a robotic manipulator performing object singulation.
arXiv Detail & Related papers (2023-03-29T01:42:54Z) - MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic
Grasping via Physics-based Metaverse Synthesis [78.26022688167133]
We present a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis.
The proposed dataset contains 100,000 images and 25 different object types.
We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance.
arXiv Detail & Related papers (2021-12-29T17:23:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.