ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills
- URL: http://arxiv.org/abs/2302.04659v1
- Date: Thu, 9 Feb 2023 14:24:01 GMT
- Title: ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills
- Authors: Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou
Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, Xiaodi Yuan, Pengwei Xie,
Zhiao Huang, Rui Chen, Hao Su
- Abstract summary: We present ManiSkill2, the next generation of the SAPIEN ManiSkill benchmark for generalizable manipulation skills.
ManiSkill2 includes 20 manipulation task families with 2000+ object models and 4M+ demonstration frames.
It defines a unified interface and evaluation protocol to support a wide range of algorithms.
It empowers fast visual input learning algorithms so that a CNN-based policy can collect samples at about 2000 FPS.
- Score: 24.150758623016195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generalizable manipulation skills, which can be composed to tackle
long-horizon and complex daily chores, are one of the cornerstones of Embodied
AI. However, existing benchmarks, mostly composed of a suite of simulatable
environments, are insufficient to push cutting-edge research works because they
lack object-level topological and geometric variations, are not based on fully
dynamic simulation, or are short of native support for multiple types of
manipulation tasks. To this end, we present ManiSkill2, the next generation of
the SAPIEN ManiSkill benchmark, to address critical pain points often
encountered by researchers when using benchmarks for generalizable manipulation
skills. ManiSkill2 includes 20 manipulation task families with 2000+ object
models and 4M+ demonstration frames, which cover stationary/mobile-base,
single/dual-arm, and rigid/soft-body manipulation tasks with 2D/3D-input data
simulated by fully dynamic engines. It defines a unified interface and
evaluation protocol to support a wide range of algorithms (e.g., classic
sense-plan-act, RL, IL), visual observations (point cloud, RGBD), and
controllers (e.g., action type and parameterization). Moreover, it empowers
fast visual input learning algorithms so that a CNN-based policy can collect
samples at about 2000 FPS with 1 GPU and 16 processes on a regular workstation.
It implements a render server infrastructure to allow sharing rendering
resources across all environments, thereby significantly reducing memory usage.
We open-source all codes of our benchmark (simulator, environments, and
baselines) and host an online challenge open to interdisciplinary researchers.
Related papers
- ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI [27.00155119759743]
ManiSkill3 is the fastest state-visual GPU parallelized robotics simulator with contact-rich physics targeting generalizable manipulation.
ManiSkill3 supports GPU parallelization of many aspects including simulation+rendering, heterogeneous simulation, pointclouds/voxels visual input, and more.
arXiv Detail & Related papers (2024-10-01T06:10:39Z) - M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place [44.303123422422246]
M2T2 is a single model that supplies different types of low-level actions that work robustly on arbitrary objects in cluttered scenes.
M2T2 is trained on a large-scale synthetic dataset with 128K scenes and achieves zero-shot sim2real transfer on the real robot.
arXiv Detail & Related papers (2023-11-02T01:42:52Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.
Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.
Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - OTOV2: Automatic, Generic, User-Friendly [39.828644638174225]
We propose the second generation of Only-Train-Once (OTOv2), which first automatically trains and compresses a general DNN only once from scratch.
OTOv2 is automatic and pluggable into various deep learning applications, and requires almost minimal engineering efforts from the users.
Numerically, we demonstrate the generality and autonomy of OTOv2 on a variety of model architectures such as VGG, ResNet, CARN, ConvNeXt, DenseNet and StackedUnets.
arXiv Detail & Related papers (2023-03-13T05:13:47Z) - ProcTHOR: Large-Scale Embodied AI Using Procedural Generation [55.485985317538194]
ProcTHOR is a framework for procedural generation of Embodied AI environments.
We demonstrate state-of-the-art results across 6 embodied AI benchmarks for navigation, rearrangement, and arm manipulation.
arXiv Detail & Related papers (2022-06-14T17:09:35Z) - MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic
Grasping via Physics-based Metaverse Synthesis [78.26022688167133]
We present a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis.
The proposed dataset contains 100,000 images and 25 different object types.
We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance.
arXiv Detail & Related papers (2021-12-29T17:23:24Z) - Habitat 2.0: Training Home Assistants to Rearrange their Habitat [122.54624752876276]
We introduce Habitat 2.0 (H2.0), a simulation platform for training virtual robots in interactive 3D environments.
We make contributions to all levels of the embodied AI stack - data, simulation, and benchmark tasks.
arXiv Detail & Related papers (2021-06-28T05:42:15Z) - Fast Object Segmentation Learning with Kernel-based Methods for Robotics [21.48920421574167]
Object segmentation is a key component in the visual system of a robot that performs tasks like grasping and object manipulation.
We propose a novel architecture for object segmentation, that overcomes this problem and provides comparable performance in a fraction of the time required by the state-of-the-art methods.
Our approach is validated on the YCB-Video dataset which is widely adopted in the computer vision and robotics community.
arXiv Detail & Related papers (2020-11-25T15:07:39Z) - Robust Policies via Mid-Level Visual Representations: An Experimental
Study in Manipulation and Navigation [115.4071729927011]
We study the effects of using mid-level visual representations as generic and easy-to-decode perceptual state in an end-to-end RL framework.
We show that they aid generalization, improve sample complexity, and lead to a higher final performance.
In practice, this means that mid-level representations could be used to successfully train policies for tasks where domain randomization and learning-from-scratch failed.
arXiv Detail & Related papers (2020-11-13T00:16:05Z) - Deep Imitation Learning for Bimanual Robotic Manipulation [70.56142804957187]
We present a deep imitation learning framework for robotic bimanual manipulation.
A core challenge is to generalize the manipulation skills to objects in different locations.
We propose to (i) decompose the multi-modal dynamics into elemental movement primitives, (ii) parameterize each primitive using a recurrent graph neural network to capture interactions, and (iii) integrate a high-level planner that composes primitives sequentially and a low-level controller to combine primitive dynamics and inverse kinematics control.
arXiv Detail & Related papers (2020-10-11T01:40:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.