MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic
Grasping via Physics-based Metaverse Synthesis
- URL: http://arxiv.org/abs/2112.14663v2
- Date: Thu, 30 Dec 2021 18:05:26 GMT
- Title: MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic
Grasping via Physics-based Metaverse Synthesis
- Authors: Yuhao Chen, E. Zhixuan Zeng, Maximilian Gilles, Alexander Wong
- Abstract summary: We present a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis.
The proposed dataset contains 100,000 images and 25 different object types.
We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance.
- Score: 78.26022688167133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There has been increasing interest in smart factories powered by robotics
systems to tackle repetitive, laborious tasks. One impactful yet challenging
task in robotics-powered smart factory applications is robotic grasping: using
robotic arms to grasp objects autonomously in different settings. Robotic
grasping requires a variety of computer vision tasks such as object detection,
segmentation, grasp prediction, pick planning, etc. While significant progress
has been made in leveraging of machine learning for robotic grasping,
particularly with deep learning, a big challenge remains in the need for
large-scale, high-quality RGBD datasets that cover a wide diversity of
scenarios and permutations. To tackle this big, diverse data problem, we are
inspired by the recent rise in the concept of metaverse, which has greatly
closed the gap between virtual worlds and the physical world. Metaverses allow
us to create digital twins of real-world manufacturing scenarios and to
virtually create different scenarios from which large volumes of data can be
generated for training models. In this paper, we present MetaGraspNet: a
large-scale benchmark dataset for vision-driven robotic grasping via
physics-based metaverse synthesis. The proposed dataset contains 100,000 images
and 25 different object types and is split into 5 difficulties to evaluate
object detection and segmentation model performance in different grasping
scenarios. We also propose a new layout-weighted performance metric alongside
the dataset for evaluating object detection and segmentation performance in a
manner that is more appropriate for robotic grasp applications compared to
existing general-purpose performance metrics. Our benchmark dataset is
available open-source on Kaggle, with the first phase consisting of detailed
object detection, segmentation, layout annotations, and a layout-weighted
performance metric script.
Related papers
- BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation [57.40024206484446]
We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models.
BVS supports a large number of adjustable parameters at the scene level.
We showcase three example application scenarios.
arXiv Detail & Related papers (2024-05-15T17:57:56Z) - ICGNet: A Unified Approach for Instance-Centric Grasping [42.92991092305974]
We introduce an end-to-end architecture for object-centric grasping.
We show the effectiveness of the proposed method by extensively evaluating it against state-of-the-art methods on synthetic datasets.
arXiv Detail & Related papers (2024-01-18T12:41:41Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.
Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.
Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - HabitatDyn Dataset: Dynamic Object Detection to Kinematics Estimation [16.36110033895749]
We propose the dataset HabitatDyn, which contains both synthetic RGB videos, semantic labels, and depth information, as well as kinetics information.
HabitatDyn was created from the perspective of a mobile robot with a moving camera, and contains 30 scenes featuring six different types of moving objects with varying velocities.
arXiv Detail & Related papers (2023-04-21T09:57:35Z) - RT-1: Robotics Transformer for Real-World Control at Scale [98.09428483862165]
We present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties.
We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks.
arXiv Detail & Related papers (2022-12-13T18:55:15Z) - Towards Multimodal Multitask Scene Understanding Models for Indoor
Mobile Agents [49.904531485843464]
In this paper, we discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments.
We describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges.
MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks.
We show that MMISM performs on par or even better than single-task models.
arXiv Detail & Related papers (2022-09-27T04:49:19Z) - MetaGraspNet: A Large-Scale Benchmark Dataset for Scene-Aware
Ambidextrous Bin Picking via Physics-based Metaverse Synthesis [72.85526892440251]
We introduce MetaGraspNet, a large-scale photo-realistic bin picking dataset constructed via physics-based metaverse synthesis.
The proposed dataset contains 217k RGBD images across 82 different article types, with full annotations for object detection, amodal perception, keypoint detection, manipulation order and ambidextrous grasp labels for a parallel-jaw and vacuum gripper.
We also provide a real dataset consisting of over 2.3k fully annotated high-quality RGBD images, divided into 5 levels of difficulties and an unseen object set to evaluate different object and layout properties.
arXiv Detail & Related papers (2022-08-08T08:15:34Z) - Few-Shot Visual Grounding for Natural Human-Robot Interaction [0.0]
We propose a software architecture that segments a target object from a crowded scene, indicated verbally by a human user.
At the core of our system, we employ a multi-modal deep neural network for visual grounding.
We evaluate the performance of the proposed model on real RGB-D data collected from public scene datasets.
arXiv Detail & Related papers (2021-03-17T15:24:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.