Rebenchmarking Unsupervised Monocular 3D Occupancy Prediction
- URL: http://arxiv.org/abs/2602.06488v1
- Date: Fri, 06 Feb 2026 08:30:26 GMT
- Title: Rebenchmarking Unsupervised Monocular 3D Occupancy Prediction
- Authors: Zizhan Guo, Yi Feng, Mengtan Zhang, Haoran Zhang, Wei Ye, Rui Fan,
- Abstract summary: Inferring the 3D structure from a single image, particularly in occluded regions, remains a fundamental yet unsolved challenge in vision-centric autonomous driving.<n>Existing unsupervised approaches typically train a neural radiance field and treat the network outputs as occupancy probabilities during evaluation.<n>This paper presents a reformulated benchmark for unsupervised monocular 3D occupancy prediction.
- Score: 18.187675837847667
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inferring the 3D structure from a single image, particularly in occluded regions, remains a fundamental yet unsolved challenge in vision-centric autonomous driving. Existing unsupervised approaches typically train a neural radiance field and treat the network outputs as occupancy probabilities during evaluation, overlooking the inconsistency between training and evaluation protocols. Moreover, the prevalent use of 2D ground truth fails to reveal the inherent ambiguity in occluded regions caused by insufficient geometric constraints. To address these issues, this paper presents a reformulated benchmark for unsupervised monocular 3D occupancy prediction. We first interpret the variables involved in the volume rendering process and identify the most physically consistent representation of the occupancy probability. Building on these analyses, we improve existing evaluation protocols by aligning the newly identified representation with voxel-wise 3D occupancy ground truth, thereby enabling unsupervised methods to be evaluated in a manner consistent with that of supervised approaches. Additionally, to impose explicit constraints in occluded regions, we introduce an occlusion-aware polarization mechanism that incorporates multi-view visual cues to enhance discrimination between occupied and free spaces in these regions. Extensive experiments demonstrate that our approach not only significantly outperforms existing unsupervised approaches but also matches the performance of supervised ones. Our source code and evaluation protocol will be made available upon publication.
Related papers
- VOIC: Visible-Occluded Decoupling for Monocular 3D Semantic Scene Completion [6.144392125326462]
Camera-based 3D Semantic Scene Completion is a critical task for autonomous driving and robotic scene understanding.<n>Existing methods typically focus on end-to-end 2D-to-3D feature lifting and voxel completion.<n>We propose a novel dual-decoder framework that explicitly decouples SSC into visible-region semantic perception and occluded-region scene completion.
arXiv Detail & Related papers (2025-12-22T02:05:45Z) - SE(3)-PoseFlow: Estimating 6D Pose Distributions for Uncertainty-Aware Robotic Manipulation [21.433019604658366]
We propose a novel probabilistic framework that leverages flow matching on the SE(3) manifold for estimating 6D object pose distributions.<n>We achieve state-of-the-art results on Real275, YCB-V, and LM-O, and demonstrate how our sample-based pose estimates can be leveraged in downstream robotic manipulation tasks.
arXiv Detail & Related papers (2025-11-03T12:11:35Z) - Enhancing Dual Network Based Semi-Supervised Medical Image Segmentation with Uncertainty-Guided Pseudo-Labeling [5.1962665598872135]
This paper proposes a novel semi-supervised 3D medical image segmentation framework based on a dual-network architecture.<n>Specifically, we investigate a Cross Consistency Enhancement module using both cross pseudo and entropy-filtered supervision to reduce the noisy pseudo-labels.<n>In addition, we use a self-supervised contrastive learning mechanism to align uncertain voxel features with reliable class prototypes.
arXiv Detail & Related papers (2025-09-16T13:40:20Z) - Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts [80.32933059529135]
Test-Time Adaptation (TTA) methods have emerged to adapt to target distributions during inference.<n>We propose Dual Uncertainty Optimization (DUO), the first TTA framework designed to jointly minimize both uncertainties for robust M3OD.<n>In parallel, we design a semantic-aware normal field constraint that preserves geometric coherence in regions with clear semantic cues.
arXiv Detail & Related papers (2025-08-28T07:09:21Z) - Object Affordance Recognition and Grounding via Multi-scale Cross-modal Representation Learning [64.32618490065117]
A core problem of Embodied AI is to learn object manipulation from observation, as humans do.<n>We propose a novel approach that learns an affordance-aware 3D representation and employs a stage-wise inference strategy.<n> Experiments demonstrate the effectiveness of our method, showing improved performance in both affordance grounding and classification.
arXiv Detail & Related papers (2025-08-02T04:14:18Z) - Zero-P-to-3: Zero-Shot Partial-View Images to 3D Object [55.93553895520324]
We propose a novel training-free approach that integrates local dense observations and multi-source priors for reconstruction.<n>Our method introduces a fusion-based strategy to effectively align these priors in DDIM sampling, thereby generating multi-view consistent images to supervise invisible views.
arXiv Detail & Related papers (2025-05-29T03:51:37Z) - ORA3D: Overlap Region Aware Multi-view 3D Object Detection [11.58746596768273]
Current multi-view 3D object detection methods often fail to detect objects in the overlap region properly.
We propose using the following two main modules: (1) Stereo Disparity Estimation for Weak Depth Supervision and (2) Adrial Overlap Region Discriversaminator.
Our proposed method outperforms current state-of-the-art models, i.e., DETR3D and BEVDet.
arXiv Detail & Related papers (2022-07-02T15:28:44Z) - On Triangulation as a Form of Self-Supervision for 3D Human Pose
Estimation [57.766049538913926]
Supervised approaches to 3D pose estimation from single images are remarkably effective when labeled data is abundant.
Much of the recent attention has shifted towards semi and (or) weakly supervised learning.
We propose to impose multi-view geometrical constraints by means of a differentiable triangulation and to use it as form of self-supervision during training when no labels are available.
arXiv Detail & Related papers (2022-03-29T19:11:54Z) - Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose
Estimation [70.32536356351706]
We introduce MRP-Net that constitutes a common deep network backbone with two output heads subscribing to two diverse configurations.
We derive suitable measures to quantify prediction uncertainty at both pose and joint level.
We present a comprehensive evaluation of the proposed approach and demonstrate state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2022-03-29T07:14:58Z) - Kinematic-Structure-Preserved Representation for Unsupervised 3D Human
Pose Estimation [58.72192168935338]
Generalizability of human pose estimation models developed using supervision on large-scale in-studio datasets remains questionable.
We propose a novel kinematic-structure-preserved unsupervised 3D pose estimation framework, which is not restrained by any paired or unpaired weak supervisions.
Our proposed model employs three consecutive differentiable transformations named as forward-kinematics, camera-projection and spatial-map transformation.
arXiv Detail & Related papers (2020-06-24T23:56:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.