XYZ-IBD: A High-precision Bin-picking Dataset for Object 6D Pose Estimation Capturing Real-world Industrial Complexity
- URL: http://arxiv.org/abs/2506.00599v2
- Date: Mon, 16 Jun 2025 15:48:51 GMT
- Title: XYZ-IBD: A High-precision Bin-picking Dataset for Object 6D Pose Estimation Capturing Real-world Industrial Complexity
- Authors: Junwen Huang, Jizhong Liang, Jiaqi Hu, Martin Sundermeyer, Peter KT Yu, Nassir Navab, Benjamin Busam,
- Abstract summary: XYZ-IBD is a bin-picking dataset for 6D pose estimation.<n>It reflects authentic robotic manipulation scenarios with millimeter-accurate annotations.<n>The dataset features 15 texture-less, metallic, and mostly symmetrical objects of varying shapes and sizes.
- Score: 46.05421425745179
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce XYZ-IBD, a bin-picking dataset for 6D pose estimation that captures real-world industrial complexity, including challenging object geometries, reflective materials, severe occlusions, and dense clutter. The dataset reflects authentic robotic manipulation scenarios with millimeter-accurate annotations. Unlike existing datasets that primarily focus on household objects, which approach saturation,XYZ-IBD represents the unsolved realistic industrial conditions. The dataset features 15 texture-less, metallic, and mostly symmetrical objects of varying shapes and sizes. These objects are heavily occluded and randomly arranged in bins with high density, replicating the challenges of real-world bin-picking. XYZ-IBD was collected using two high-precision industrial cameras and one commercially available camera, providing RGB, grayscale, and depth images. It contains 75 multi-view real-world scenes, along with a large-scale synthetic dataset rendered under simulated bin-picking conditions. We employ a meticulous annotation pipeline that includes anti-reflection spray, multi-view depth fusion, and semi-automatic annotation, achieving millimeter-level pose labeling accuracy required for industrial manipulation. Quantification in simulated environments confirms the reliability of the ground-truth annotations. We benchmark state-of-the-art methods on 2D detection, 6D pose estimation, and depth estimation tasks on our dataset, revealing significant performance degradation in our setups compared to current academic household benchmarks. By capturing the complexity of real-world bin-picking scenarios, XYZ-IBD introduces more realistic and challenging problems for future research. The dataset and benchmark are publicly available at https://xyz-ibd.github.io/XYZ-IBD/.
Related papers
- Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting [64.64738535860351]
We present a scalable pipeline that converts single-view images into comprehensive, scale- and appearance-realistic 3D representations.<n>Our method bridges the gap between the vast repository of imagery and the increasing demand for spatial scene understanding.<n>By automatically generating authentic, scale-aware 3D data from images, we significantly reduce data collection costs and open new avenues for advancing spatial intelligence.
arXiv Detail & Related papers (2025-07-24T14:53:26Z) - GraspClutter6D: A Large-scale Real-world Dataset for Robust Perception and Grasping in Cluttered Scenes [5.289647064481469]
We present GraspClutter6D, a large-scale real-world grasping dataset featuring 1,000 cluttered scenes with dense arrangements.<n>We benchmark state-of-the-art segmentation, object pose estimation, and grasping detection methods to provide key insights into challenges in cluttered environments.<n>We validate the dataset's effectiveness as a training resource, demonstrating that grasping networks trained on GraspClutter6D significantly outperform those trained on existing datasets in both simulation and real-world experiments.
arXiv Detail & Related papers (2025-04-09T13:15:46Z) - OmniEraser: Remove Objects and Their Effects in Images with Paired Video-Frame Data [21.469971783624402]
In this paper, we propose Video4Removal, a large-scale dataset comprising over 100,000 high-quality samples with realistic object shadows and reflections.<n>By constructing object-background pairs from video frames with off-the-shelf vision models, the labor costs of data acquisition can be significantly reduced.<n>To avoid generating shape-like artifacts and unintended content, we propose Object-Background Guidance.<n>We present OmniEraser, a novel method that seamlessly removes objects and their visual effects using only object masks as input.
arXiv Detail & Related papers (2025-01-13T15:12:40Z) - Boundless: Generating Photorealistic Synthetic Data for Object Detection in Urban Streetscapes [7.948212109423146]
We introduce Boundless, a photo-realistic synthetic data generation system for object detection in dense urban streetscapes.
Boundless can replace massive real-world data collection and manual ground-truth object annotation (labeling)
We evaluate the performance of object detection models trained on the dataset generated by Boundless.
arXiv Detail & Related papers (2024-09-04T18:28:10Z) - Multi-Modal Dataset Acquisition for Photometrically Challenging Object [56.30027922063559]
This paper addresses the limitations of current datasets for 3D vision tasks in terms of accuracy, size, realism, and suitable imaging modalities for photometrically challenging objects.
We propose a novel annotation and acquisition pipeline that enhances existing 3D perception and 6D object pose datasets.
arXiv Detail & Related papers (2023-08-21T10:38:32Z) - The Drunkard's Odometry: Estimating Camera Motion in Deforming Scenes [79.00228778543553]
This dataset is the first large set of exploratory camera trajectories with ground truth inside 3D scenes.
Simulations in realistic 3D buildings lets us obtain a vast amount of data and ground truth labels.
We present a novel deformable odometry method, dubbed the Drunkard's Odometry, which decomposes optical flow estimates into rigid-body camera motion.
arXiv Detail & Related papers (2023-06-29T13:09:31Z) - MetaGraspNet: A Large-Scale Benchmark Dataset for Scene-Aware
Ambidextrous Bin Picking via Physics-based Metaverse Synthesis [72.85526892440251]
We introduce MetaGraspNet, a large-scale photo-realistic bin picking dataset constructed via physics-based metaverse synthesis.
The proposed dataset contains 217k RGBD images across 82 different article types, with full annotations for object detection, amodal perception, keypoint detection, manipulation order and ambidextrous grasp labels for a parallel-jaw and vacuum gripper.
We also provide a real dataset consisting of over 2.3k fully annotated high-quality RGBD images, divided into 5 levels of difficulties and an unseen object set to evaluate different object and layout properties.
arXiv Detail & Related papers (2022-08-08T08:15:34Z) - A Multi-purpose Real Haze Benchmark with Quantifiable Haze Levels and
Ground Truth [61.90504318229845]
This paper introduces the first paired real image benchmark dataset with hazy and haze-free images, and in-situ haze density measurements.
This dataset was produced in a controlled environment with professional smoke generating machines that covered the entire scene.
A subset of this dataset has been used for the Object Detection in Haze Track of CVPR UG2 2022 challenge.
arXiv Detail & Related papers (2022-06-13T19:14:06Z) - PhoCaL: A Multi-Modal Dataset for Category-Level Object Pose Estimation
with Photometrically Challenging Objects [45.31344700263873]
We introduce a multimodal dataset for category-level object pose estimation with photometrically challenging objects termed PhoCaL.
PhoCaL comprises 60 high quality 3D models of household objects over 8 categories including highly reflective, transparent and symmetric objects.
It ensures sub-millimeter accuracy of the pose for opaque textured, shiny and transparent objects, no motion blur and perfect camera synchronisation.
arXiv Detail & Related papers (2022-05-18T09:21:09Z) - 6-DoF Pose Estimation of Household Objects for Robotic Manipulation: An
Accessible Dataset and Benchmark [17.493403705281008]
We present a new dataset for 6-DoF pose estimation of known objects, with a focus on robotic manipulation research.
We provide 3D scanned textured models of toy grocery objects, as well as RGBD images of the objects in challenging, cluttered scenes.
Using semi-automated RGBD-to-model texture correspondences, the images are annotated with ground truth poses that were verified empirically to be accurate to within a few millimeters.
We also propose a new pose evaluation metric called ADD-H based upon the Hungarian assignment algorithm that is robust to symmetries in object geometry without requiring their explicit enumeration.
arXiv Detail & Related papers (2022-03-11T01:19:04Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.