Semantic Causality-Aware Vision-Based 3D Occupancy Prediction
- URL: http://arxiv.org/abs/2509.08388v1
- Date: Wed, 10 Sep 2025 08:29:22 GMT
- Title: Semantic Causality-Aware Vision-Based 3D Occupancy Prediction
- Authors: Dubing Chen, Huan Zheng, Yucheng Zhou, Xianfei Li, Wenlong Liao, Tao He, Pai Peng, Jianbing Shen,
- Abstract summary: Vision-based 3D semantic occupancy prediction is a critical task in 3D vision.<n>Existing methods, however, often rely on modular pipelines.<n>We propose a novel causal loss that enables holistic, end-to-end supervision of the modular 2D-to-3D transformation pipeline.
- Score: 63.752869043357585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-based 3D semantic occupancy prediction is a critical task in 3D vision that integrates volumetric 3D reconstruction with semantic understanding. Existing methods, however, often rely on modular pipelines. These modules are typically optimized independently or use pre-configured inputs, leading to cascading errors. In this paper, we address this limitation by designing a novel causal loss that enables holistic, end-to-end supervision of the modular 2D-to-3D transformation pipeline. Grounded in the principle of 2D-to-3D semantic causality, this loss regulates the gradient flow from 3D voxel representations back to the 2D features. Consequently, it renders the entire pipeline differentiable, unifying the learning process and making previously non-trainable components fully learnable. Building on this principle, we propose the Semantic Causality-Aware 2D-to-3D Transformation, which comprises three components guided by our causal loss: Channel-Grouped Lifting for adaptive semantic mapping, Learnable Camera Offsets for enhanced robustness against camera perturbations, and Normalized Convolution for effective feature propagation. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the Occ3D benchmark, demonstrating significant robustness to camera perturbations and improved 2D-to-3D semantic consistency.
Related papers
- AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models [42.57469056850227]
Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic perception and control.<n>We propose a novel framework that integrates depth estimation into VLA models to enrich 3D feature representations.
arXiv Detail & Related papers (2026-02-11T09:57:32Z) - Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment [15.822150318879052]
We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment.<n>We train a lightweight feature adapter using a reprojection-based consistency loss.<n>This enables state-of-the-art performance in both NVS and camera pose estimation.
arXiv Detail & Related papers (2025-12-09T18:59:52Z) - Where, Not What: Compelling Video LLMs to Learn Geometric Causality for 3D-Grounding [0.8883733362171032]
We propose a novel training framework called What-Where Representation Re-Forming (W2R2) to tackle this issue.<n>Our approach fundamentally reshapes the model's internal space by designating 2D features as semantic beacons for "What" identification and 3D features as spatial anchors for "Where" localization.<n>Experiments conducted on ScanRefer and ScanQA demonstrate the effectiveness of W2R2, with significant gains in localization accuracy and robustness.
arXiv Detail & Related papers (2025-10-19T22:40:18Z) - ALOcc: Adaptive Lifting-Based 3D Semantic Occupancy and Cost Volume-Based Flow Predictions [91.55655961014027]
3D semantic occupancy and flow prediction are fundamental to understanding scene scene.<n>This paper proposes a vision-based framework with three targeted improvements.<n>Our purely convolutional architecture establishes new SOTA performance on multiple benchmarks for both semantic occupancy and joint semantic-flow prediction.
arXiv Detail & Related papers (2024-11-12T11:32:56Z) - GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision [49.839374549646884]
This paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception.<n>Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone.
arXiv Detail & Related papers (2024-05-17T07:31:20Z) - DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation [53.20147419879056]
We introduce a diffusion-based feed-forward framework to address challenges with a single model.
Building upon our 3D-aware Diffusion model with TransFormer, we propose a stronger version for 3D generation, i.e., DiffTF++.
Experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules.
arXiv Detail & Related papers (2024-05-13T17:59:51Z) - UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation.
It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z) - Vox-E: Text-guided Voxel Editing of 3D Objects [14.88446525549421]
Large scale text-guided diffusion models have garnered significant attention due to their ability to synthesize diverse images.
We present a technique that harnesses the power of latent diffusion models for editing existing 3D objects.
arXiv Detail & Related papers (2023-03-21T17:36:36Z) - Homography Loss for Monocular 3D Object Detection [54.04870007473932]
A differentiable loss function, termed as Homography Loss, is proposed to achieve the goal, which exploits both 2D and 3D information.
Our method yields the best performance compared with the other state-of-the-arts by a large margin on KITTI 3D datasets.
arXiv Detail & Related papers (2022-04-02T03:48:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.