Related papers: SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor Variations

SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor Variations

URL: http://arxiv.org/abs/2507.05751v1
Date: Tue, 08 Jul 2025 07:54:07 GMT
Title: SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor Variations
Authors: Yegyu Han, Taegyoon Yoon, Dayeon Woo, Sojeong Kim, Hyung-Sin Kim,
Abstract summary: We introduce SenseShift6D, the first RGB-D dataset that physically sweeps 13 exposures, 9 RGB gains, auto-exposure, 4 depth-capture modes, and 5 illumination levels.<n>For three common household objects (spray, pringles, and tincase), we acquire 101.9k RGB and 10k depth images, which can provide 1,380 unique sensor-lighting permutations per object pose.<n>Experiments with state-of-the-art models on our dataset show that applying sensor control during test-time induces greater performance improvement over digital data augmentation.
Score: 1.8350044465969415
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances on 6D object-pose estimation has achieved high performance on representative benchmarks such as LM-O, YCB-V, and T-Less. However, these datasets were captured under fixed illumination and camera settings, leaving the impact of real-world variations in illumination, exposure, gain or depth-sensor mode - and the potential of test-time sensor control to mitigate such variations - largely unexplored. To bridge this gap, we introduce SenseShift6D, the first RGB-D dataset that physically sweeps 13 RGB exposures, 9 RGB gains, auto-exposure, 4 depth-capture modes, and 5 illumination levels. For three common household objects (spray, pringles, and tincase), we acquire 101.9k RGB and 10k depth images, which can provide 1,380 unique sensor-lighting permutations per object pose. Experiments with state-of-the-art models on our dataset show that applying sensor control during test-time induces greater performance improvement over digital data augmentation, achieving performance comparable to or better than costly increases in real-world training data quantity and diversity. Adapting either RGB or depth sensors individually is effective, while jointly adapting multimodal RGB-D configurations yields even greater improvements. SenseShift6D extends the 6D-pose evaluation paradigm from data-centered to sensor-aware robustness, laying a foundation for adaptive, self-tuning perception systems capable of operating robustly in uncertain real-world environments. Our dataset is available at: huggingface.co/datasets/Yegyu/SenseShift6D Associated scripts can be found at: github.com/yegyu-han/SenseShift6D

Related papers

Photoreal Scene Reconstruction from an Egocentric Device [5.581317382137083]
Existing methodologies assume using frame-rate 6DoF pose estimated from the device's visual-inertial odometry system.<n>We employ visual-inertial bundle adjustment (VIBA) to calibrate the precise timestamps and movement of the rolling shutter RGB sensing camera.<n>We incorporate a physical image formation model based into Gaussian Splatting, which effectively addresses the sensor characteristics.
arXiv Detail & Related papers (2025-06-04T20:53:43Z)
ReFlow6D: Refraction-Guided Transparent Object 6D Pose Estimation via Intermediate Representation Learning [48.29147383536012]
We present ReFlow6D, a novel method for transparent object 6D pose estimation.<n>Unlike conventional approaches, our method leverages a feature space impervious to changes in RGB image space and independent of depth information.<n>We show that ReFlow6D achieves precise 6D pose estimation of transparent objects, using only RGB images as input.
arXiv Detail & Related papers (2024-12-30T09:53:26Z)
MSSIDD: A Benchmark for Multi-Sensor Denoising [55.41612200877861]
We introduce a new benchmark, the Multi-Sensor SIDD dataset, which is the first raw-domain dataset designed to evaluate the sensor transferability of denoising models. We propose a sensor consistency training framework that enables denoising models to learn the sensor-invariant features.
arXiv Detail & Related papers (2024-11-18T13:32:59Z)
Multimodal Object Detection using Depth and Image Data for Manufacturing Parts [1.0819408603463427]
This work proposes a multi-sensor system combining an red-green-blue (RGB) camera and a 3D point cloud sensor.<n>A novel multimodal object detection method is developed to process both RGB and depth data.<n>The results show that the multimodal model significantly outperforms the depth-only and RGB-only baselines on established object detection metrics.
arXiv Detail & Related papers (2024-11-13T22:43:15Z)
A Recurrent YOLOv8-based framework for Event-Based Object Detection [4.866548300593921]
This study introduces ReYOLOv8, an advanced object detection framework that enhances a frame-based detection system withtemporal modeling capabilities. We implement a low-latency, memory-efficient method for encoding event data to boost the system's performance. We also developed a novel data augmentation technique tailored to leverage the unique attributes of event data, thus improving detection accuracy.
arXiv Detail & Related papers (2024-08-09T20:00:16Z)
DIDLM: A SLAM Dataset for Difficult Scenarios Featuring Infrared, Depth Cameras, LIDAR, 4D Radar, and Others under Adverse Weather, Low Light Conditions, and Rough Roads [20.600516423425688]
We introduce a multi-sensor dataset covering challenging scenarios such as snowy weather, rainy weather, nighttime conditions, speed bumps, and rough terrains.<n>The dataset includes rarely utilized sensors for extreme conditions, such as 4D millimeter-wave radar, infrared cameras, and depth cameras, alongside 3D LiDAR, RGB cameras, GPS, and IMU.<n>It supports both autonomous driving and ground robot applications and provides reliable GPS/INS ground truth data, covering structured and semi-structured terrains.
arXiv Detail & Related papers (2024-04-15T09:49:33Z)
Robust Depth Enhancement via Polarization Prompt Fusion Tuning [112.88371907047396]
We present a framework that leverages polarization imaging to improve inaccurate depth measurements from various depth sensors. Our method first adopts a learning-based strategy where a neural network is trained to estimate a dense and complete depth map from polarization data and a sensor depth map from different sensors. To further improve the performance, we propose a Polarization Prompt Fusion Tuning (PPFT) strategy to effectively utilize RGB-based models pre-trained on large-scale datasets.
arXiv Detail & Related papers (2024-04-05T17:55:33Z)
RGB-based Category-level Object Pose Estimation via Decoupled Metric Scale Recovery [72.13154206106259]
We propose a novel pipeline that decouples the 6D pose and size estimation to mitigate the influence of imperfect scales on rigid transformations. Specifically, we leverage a pre-trained monocular estimator to extract local geometric information. A separate branch is designed to directly recover the metric scale of the object based on category-level statistics.
arXiv Detail & Related papers (2023-09-19T02:20:26Z)
Multi-Modal Neural Radiance Field for Monocular Dense SLAM with a Light-Weight ToF Sensor [58.305341034419136]
We present the first dense SLAM system with a monocular camera and a light-weight ToF sensor. We propose a multi-modal implicit scene representation that supports rendering both the signals from the RGB camera and light-weight ToF sensor. Experiments demonstrate that our system well exploits the signals of light-weight ToF sensors and achieves competitive results.
arXiv Detail & Related papers (2023-08-28T07:56:13Z)
Collision-aware In-hand 6D Object Pose Estimation using Multiple Vision-based Tactile Sensors [4.886250215151643]
We reason on the possible spatial configurations of the sensors along the object surface. We use selected sensors configurations to optimize over the space of 6D poses. We rank the obtained poses by penalizing those that are in collision with the sensors.
arXiv Detail & Related papers (2023-01-31T14:35:26Z)
FloatingFusion: Depth from ToF and Image-stabilized Stereo Cameras [37.812681878193914]
smartphones now have multimodal camera systems with time-of-flight (ToF) depth sensors and multiple color cameras. producing accurate high-resolution depth is still challenging due to the low resolution and limited active illumination power of ToF sensors. We propose an automatic calibration technique based on dense 2D/3D matching that can estimate camera parameters from a single snapshot.
arXiv Detail & Related papers (2022-10-06T09:57:09Z)
DELTAR: Depth Estimation from a Light-weight ToF Sensor and RGB Image [39.389538555506256]
We propose DELTAR, a novel method to empower light-weight ToF sensors with the capability of measuring high resolution and accurate depth. As the core of DELTAR, a feature extractor customized for depth distribution and an attention-based neural architecture is proposed to fuse the information from the color and ToF domain efficiently. Experiments show that our method produces more accurate depth than existing frameworks designed for depth completion and depth super-resolution and achieves on par performance with a commodity-level RGB-D sensor.
arXiv Detail & Related papers (2022-09-27T13:11:37Z)
Learning Online Multi-Sensor Depth Fusion [100.84519175539378]
SenFuNet is a depth fusion approach that learns sensor-specific noise and outlier statistics. We conduct experiments with various sensor combinations on the real-world CoRBS and Scene3D datasets.
arXiv Detail & Related papers (2022-04-07T10:45:32Z)
Learning Enriched Illuminants for Cross and Single Sensor Color Constancy [182.4997117953705]
We propose cross-sensor self-supervised training to train the network. We train the network by randomly sampling the artificial illuminants in a sensor-independent manner. Experiments show that our cross-sensor model and single-sensor model outperform other state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-03-21T15:45:35Z)
Multi-sensor large-scale dataset for multi-view 3D reconstruction [63.59401680137808]
We present a new multi-sensor dataset for multi-view 3D surface reconstruction. It includes registered RGB and depth data from sensors of different resolutions and modalities: smartphones, Intel RealSense, Microsoft Kinect, industrial cameras, and structured-light scanner. We provide around 1.4 million images of 107 different scenes acquired from 100 viewing directions under 14 lighting conditions.
arXiv Detail & Related papers (2022-03-11T17:32:27Z)
Joint Learning of Salient Object Detection, Depth Estimation and Contour Extraction [91.43066633305662]
We propose a novel multi-task and multi-modal filtered transformer (MMFT) network for RGB-D salient object detection (SOD) Specifically, we unify three complementary tasks: depth estimation, salient object detection and contour estimation. The multi-task mechanism promotes the model to learn the task-aware features from the auxiliary tasks. Experiments show that it not only significantly surpasses the depth-based RGB-D SOD methods on multiple datasets, but also precisely predicts a high-quality depth map and salient contour at the same time.
arXiv Detail & Related papers (2022-03-09T17:20:18Z)
Learning Monocular Dense Depth from Events [53.078665310545745]
Event cameras produce brightness changes in the form of a stream of asynchronous events instead of intensity frames. Recent learning-based approaches have been applied to event-based data, such as monocular depth prediction. We propose a recurrent architecture to solve this task and show significant improvement over standard feed-forward methods.
arXiv Detail & Related papers (2020-10-16T12:36:23Z)
RGB-D-E: Event Camera Calibration for Fast 6-DOF Object Tracking [16.06615504110132]
We propose to use an event-based camera to increase the speed of 3D object tracking in 6 degrees of freedom. This application requires handling very high object speed to convey compelling AR experiences. We develop a deep learning approach, which combines an existing RGB-D network along with a novel event-based network in a cascade fashion.
arXiv Detail & Related papers (2020-06-09T01:55:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.