VPOcc: Exploiting Vanishing Point for 3D Semantic Occupancy Prediction
- URL: http://arxiv.org/abs/2408.03551v2
- Date: Thu, 14 Aug 2025 04:31:02 GMT
- Title: VPOcc: Exploiting Vanishing Point for 3D Semantic Occupancy Prediction
- Authors: Junsu Kim, Junhee Lee, Ukcheol Shin, Jean Oh, Kyungdon Joo,
- Abstract summary: Understanding 3D scenes semantically and spatially is crucial for the safe navigation of robots and autonomous vehicles.<n>Camera-based 3D semantic occupancy prediction infers complete voxel grids from 2D images.<n>This task inherently suffers from a 2D-3D discrepancy, where objects of the same size in 3D space appear at different scales in a 2D image depending on their distance from the camera.<n>We propose a novel framework called VPOcc that leverages a vanishing point (VP) to mitigate the 2D-3D discrepancy at both the pixel and feature levels.
- Score: 24.947072696837118
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding 3D scenes semantically and spatially is crucial for the safe navigation of robots and autonomous vehicles, aiding obstacle avoidance and accurate trajectory planning. Camera-based 3D semantic occupancy prediction, which infers complete voxel grids from 2D images, is gaining importance in robot vision for its resource efficiency compared to 3D sensors. However, this task inherently suffers from a 2D-3D discrepancy, where objects of the same size in 3D space appear at different scales in a 2D image depending on their distance from the camera due to perspective projection. To tackle this issue, we propose a novel framework called VPOcc that leverages a vanishing point (VP) to mitigate the 2D-3D discrepancy at both the pixel and feature levels. As a pixel-level solution, we introduce a VPZoomer module, which warps images by counteracting the perspective effect using a VP-based homography transformation. In addition, as a feature-level solution, we propose a VP-guided cross-attention (VPCA) module that performs perspective-aware feature aggregation, utilizing 2D image features that are more suitable for 3D space. Lastly, we integrate two feature volumes extracted from the original and warped images to compensate for each other through a spatial volume fusion (SVF) module. By effectively incorporating VP into the network, our framework achieves improvements in both IoU and mIoU metrics on SemanticKITTI and SSCBench-KITTI360 datasets. Additional details are available at https://vision3d-lab.github.io/vpocc/.
Related papers
- HD$^2$-SSC: High-Dimension High-Density Semantic Scene Completion for Autonomous Driving [52.959716866316604]
Camera-based 3D semantic scene completion (SSC) plays a crucial role in autonomous driving.<n>Existing SSC methods suffer from the inherent input-output dimension gap and annotation-reality density gap.<n>We propose a corresponding High- Dimension High-Density Semantic Scene Completion framework with expanded pixel semantics and refined voxel occupancies.
arXiv Detail & Related papers (2025-11-11T07:24:35Z) - Where, Not What: Compelling Video LLMs to Learn Geometric Causality for 3D-Grounding [0.8883733362171032]
We propose a novel training framework called What-Where Representation Re-Forming (W2R2) to tackle this issue.<n>Our approach fundamentally reshapes the model's internal space by designating 2D features as semantic beacons for "What" identification and 3D features as spatial anchors for "Where" localization.<n>Experiments conducted on ScanRefer and ScanQA demonstrate the effectiveness of W2R2, with significant gains in localization accuracy and robustness.
arXiv Detail & Related papers (2025-10-19T22:40:18Z) - Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting [64.64738535860351]
We present a scalable pipeline that converts single-view images into comprehensive, scale- and appearance-realistic 3D representations.<n>Our method bridges the gap between the vast repository of imagery and the increasing demand for spatial scene understanding.<n>By automatically generating authentic, scale-aware 3D data from images, we significantly reduce data collection costs and open new avenues for advancing spatial intelligence.
arXiv Detail & Related papers (2025-07-24T14:53:26Z) - 3D Prior is All You Need: Cross-Task Few-shot 2D Gaze Estimation [27.51272922798475]
We introduce a novel cross-task 2D gaze estimation approach, aiming to adapt a pre-trained 3D gaze estimation network for 2D gaze prediction on unseen devices.<n>This task is highly challenging due to the domain gap between 3D and 2D gaze, unknown screen poses, and limited training data.<n>We evaluate our method on MPIIGaze, EVE, and GazeCapture datasets, collected respectively on laptops, desktop computers, and mobile devices.
arXiv Detail & Related papers (2025-02-06T13:37:09Z) - ViPOcc: Leveraging Visual Priors from Vision Foundation Models for Single-View 3D Occupancy Prediction [11.312780421161204]
In this paper, we propose ViPOcc, which leverages the visual priors from vision foundation models for fine-grained 3D occupancy prediction.
We also propose a semantic-guided non-overlapping Gaussian mixture sampler for efficient, instance-aware ray sampling.
Our experiments demonstrate the superior performance of ViPOcc in both 3D occupancy prediction and depth estimation tasks.
arXiv Detail & Related papers (2024-12-15T15:04:27Z) - BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence [11.91274849875519]
We introduce a novel image-centric 3D perception model, BIP3D, to overcome the limitations of point-centric methods.<n>We leverage pre-trained 2D vision foundation models to enhance semantic understanding, and introduce a spatial enhancer module to improve spatial understanding.<n>In our experiments, BIP3D outperforms current state-of-the-art results on the EmbodiedScan benchmark, achieving improvements of 5.69% in the 3D detection task and 15.25% in the 3D visual grounding task.
arXiv Detail & Related papers (2024-11-22T11:35:42Z) - Robust 3D Semantic Occupancy Prediction with Calibration-free Spatial Transformation [32.50849425431012]
For autonomous cars equipped with multi-camera and LiDAR, it is critical to aggregate multi-sensor information into a unified 3D space for accurate and robust predictions.
Recent methods are mainly built on the 2D-to-3D transformation that relies on sensor calibration to project the 2D image information into the 3D space.
In this work, we propose a calibration-free spatial transformation based on vanilla attention to implicitly model the spatial correspondence.
arXiv Detail & Related papers (2024-11-19T02:40:42Z) - PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection [59.355022416218624]
integration of point and voxel representations is becoming more common in LiDAR-based 3D object detection.
We propose a novel two-stage 3D object detector, called Point-Voxel Attention Fusion Network (PVAFN)
PVAFN uses a multi-pooling strategy to integrate both multi-scale and region-specific information effectively.
arXiv Detail & Related papers (2024-08-26T19:43:01Z) - Playing to Vision Foundation Model's Strengths in Stereo Matching [13.887661472501618]
This study serves as the first exploration of a viable approach for adapting vision foundation models (VFMs) to stereo matching.
Our ViT adapter, referred to as ViTAS, is constructed upon three types of modules: spatial differentiation, patch attention fusion, and cross-attention.
ViTAStereo outperforms the second-best network StereoBase by approximately 7.9% in terms of the percentage of error pixels, with a tolerance of 3 pixels.
arXiv Detail & Related papers (2024-04-09T12:34:28Z) - Co-Occ: Coupling Explicit Feature Fusion with Volume Rendering Regularization for Multi-Modal 3D Semantic Occupancy Prediction [10.698054425507475]
This letter presents a novel multi-modal, i.e., LiDAR-camera 3D semantic occupancy prediction framework, dubbed Co-Occ.
volume rendering in the feature space can proficiently bridge the gap between 3D LiDAR sweeps and 2D images.
arXiv Detail & Related papers (2024-04-06T09:01:19Z) - NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized
Device Coordinates Space [77.6067460464962]
Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs.
We identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Imbalance in the 3D convolution across different depth levels.
We devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2
arXiv Detail & Related papers (2023-09-26T02:09:52Z) - EP2P-Loc: End-to-End 3D Point to 2D Pixel Localization for Large-Scale
Visual Localization [44.05930316729542]
We propose EP2P-Loc, a novel large-scale visual localization method for 3D point clouds.
To increase the number of inliers, we propose a simple algorithm to remove invisible 3D points in the image.
For the first time in this task, we employ a differentiable for end-to-end training.
arXiv Detail & Related papers (2023-09-14T07:06:36Z) - Parametric Depth Based Feature Representation Learning for Object
Detection and Segmentation in Bird's Eye View [44.78243406441798]
This paper focuses on leveraging geometry information, such as depth, to model such feature transformation.
We first lift the 2D image features to the 3D space defined for the ego vehicle via a predicted parametric depth distribution for each pixel in each view.
We then aggregate the 3D feature volume based on the 3D space occupancy derived from depth to the BEV frame.
arXiv Detail & Related papers (2023-07-09T06:07:22Z) - A Simple Baseline for Supervised Surround-view Depth Estimation [25.81521612343612]
We propose S3Depth, a Simple Baseline for Supervised Surround-view Depth Estimation.
We employ a global-to-local feature extraction module which combines CNN with transformer layers for enriched representations.
Our method achieves superior performance over existing state-of-the-art methods on both DDAD and nuScenes datasets.
arXiv Detail & Related papers (2023-03-14T10:06:19Z) - OA-BEV: Bringing Object Awareness to Bird's-Eye-View Representation for
Multi-Camera 3D Object Detection [78.38062015443195]
OA-BEV is a network that can be plugged into the BEV-based 3D object detection framework.
Our method achieves consistent improvements over the BEV-based baselines in terms of both average precision and nuScenes detection score.
arXiv Detail & Related papers (2023-01-13T06:02:31Z) - VPIT: Real-time Embedded Single Object 3D Tracking Using Voxel Pseudo Images [90.60881721134656]
We propose a novel voxel-based 3D single object tracking (3D SOT) method called Voxel Pseudo Image Tracking (VPIT)
Experiments on KITTI Tracking dataset show that VPIT is the fastest 3D SOT method and maintains competitive Success and Precision values.
arXiv Detail & Related papers (2022-06-06T14:02:06Z) - Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data [80.14669385741202]
We propose a self-supervised pre-training method for 3D perception models tailored to autonomous driving data.
We leverage the availability of synchronized and calibrated image and Lidar sensors in autonomous driving setups.
Our method does not require any point cloud nor image annotations.
arXiv Detail & Related papers (2022-03-30T12:40:30Z) - VPFusion: Joint 3D Volume and Pixel-Aligned Feature Fusion for Single
and Multi-view 3D Reconstruction [23.21446438011893]
VPFusionattains high-quality reconstruction using both - 3D feature volume to capture 3D-structure-aware context.
Existing approaches use RNN, feature pooling, or attention computed independently in each view for multi-view fusion.
We show improved multi-view feature fusion by establishing transformer-based pairwise view association.
arXiv Detail & Related papers (2022-03-14T23:30:58Z) - FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection [78.00922683083776]
It is non-trivial to make a general adapted 2D detector work in this 3D task.
In this technical report, we study this problem with a practice built on fully convolutional single-stage detector.
Our solution achieves 1st place out of all the vision-only methods in the nuScenes 3D detection challenge of NeurIPS 2020.
arXiv Detail & Related papers (2021-04-22T09:35:35Z) - M3DSSD: Monocular 3D Single Stage Object Detector [82.25793227026443]
We propose a Monocular 3D Single Stage object Detector (M3DSSD) with feature alignment and asymmetric non-local attention.
The proposed M3DSSD achieves significantly better performance than the monocular 3D object detection methods on the KITTI dataset.
arXiv Detail & Related papers (2021-03-24T13:09:11Z) - Object-Centric Multi-View Aggregation [86.94544275235454]
We present an approach for aggregating a sparse set of views of an object in order to compute a semi-implicit 3D representation in the form of a volumetric feature grid.
Key to our approach is an object-centric canonical 3D coordinate system into which views can be lifted, without explicit camera pose estimation.
We show that computing a symmetry-aware mapping from pixels to the canonical coordinate system allows us to better propagate information to unseen regions.
arXiv Detail & Related papers (2020-07-20T17:38:31Z) - Cylindrical Convolutional Networks for Joint Object Detection and
Viewpoint Estimation [76.21696417873311]
We introduce a learnable module, cylindrical convolutional networks (CCNs), that exploit cylindrical representation of a convolutional kernel defined in the 3D space.
CCNs extract a view-specific feature through a view-specific convolutional kernel to predict object category scores at each viewpoint.
Our experiments demonstrate the effectiveness of the cylindrical convolutional networks on joint object detection and viewpoint estimation.
arXiv Detail & Related papers (2020-03-25T10:24:58Z) - ZoomNet: Part-Aware Adaptive Zooming Neural Network for 3D Object
Detection [69.68263074432224]
We present a novel framework named ZoomNet for stereo imagery-based 3D detection.
The pipeline of ZoomNet begins with an ordinary 2D object detection model which is used to obtain pairs of left-right bounding boxes.
To further exploit the abundant texture cues in RGB images for more accurate disparity estimation, we introduce a conceptually straight-forward module -- adaptive zooming.
arXiv Detail & Related papers (2020-03-01T17:18:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.