Related papers: Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction

Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction

URL: http://arxiv.org/abs/2602.21552v1
Date: Wed, 25 Feb 2026 04:16:54 GMT
Title: Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction
Authors: Changqing Zhou, Yueru Luo, Changhao Chen,
Abstract summary: We present GPOcc, a framework that leverages visual geometry priors for monocular occupancy prediction.<n>Experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate significant gains.
Score: 10.394184895110007
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Accurate 3D scene understanding is essential for embodied intelligence, with occupancy prediction emerging as a key task for reasoning about both objects and free space. Existing approaches largely rely on depth priors (e.g., DepthAnything) but make only limited use of 3D cues, restricting performance and generalization. Recently, visual geometry models such as VGGT have shown strong capability in providing rich 3D priors, but similar to monocular depth foundation models, they still operate at the level of visible surfaces rather than volumetric interiors, motivating us to explore how to more effectively leverage these increasingly powerful geometry priors for 3D occupancy prediction. We present GPOcc, a framework that leverages generalizable visual geometry priors (GPs) for monocular occupancy prediction. Our method extends surface points inward along camera rays to generate volumetric samples, which are represented as Gaussian primitives for probabilistic occupancy inference. To handle streaming input, we further design a training-free incremental update strategy that fuses per-frame Gaussians into a unified global representation. Experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate significant gains: GPOcc improves mIoU by +9.99 in the monocular setting and +11.79 in the streaming setting over prior state of the art. Under the same depth prior, it achieves +6.73 mIoU while running 2.65$\times$ faster. These results highlight that GPOcc leverages geometry priors more effectively and efficiently. Code will be released at https://github.com/JuIvyy/GPOcc.

Related papers

Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation [53.09168514034483]
Bimanual manipulation requires policies that can reason about 3D geometry, anticipate how it evolves under action, and generate smooth, coordinated motions.<n>We propose a framework that builds bimanual manipulation directly on a pre-trained 3D geometric foundation model.<n>Our policy fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, and uses diffusion model to jointly predict a future action chunk and a future 3D latent that decodes into a dense pointmap.
arXiv Detail & Related papers (2026-02-27T08:54:20Z)
GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation [26.632472450402947]
Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric.<n>We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors.<n>Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines.
arXiv Detail & Related papers (2025-12-18T17:51:42Z)
PlanarGS: High-Fidelity Indoor 3D Gaussian Splatting Guided by Vision-Language Planar Priors [13.825701925456768]
PlanarGS is a 3DGS-based framework tailored for indoor scene reconstruction.<n>PlanarGS reconstructs accurate and detailed 3D surfaces, consistently outperforming state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2025-10-27T23:32:19Z)
DGOcc: Depth-aware Global Query-based Network for Monocular 3D Occupancy Prediction [17.38916914453357]
Predicting the 3D occupancy of large-scale outdoor scenes from 2D images is ill-posed and resource-intensive.<n>We present textbfDGOcc, a textbfGlobal query-based network for monocular 3D textbfOccupancy prediction.<n>The proposed method achieves the best performance on monocular semantic occupancy prediction while reducing GPU and time overhead.
arXiv Detail & Related papers (2025-04-10T07:44:55Z)
GaussRender: Learning 3D Occupancy with Gaussian Rendering [86.89653628311565]
GaussRender is a module that improves 3D occupancy learning by enforcing projective consistency.<n>Our method penalizes 3D configurations that produce inconsistent 2D projections, thereby enforcing a more coherent 3D structure.
arXiv Detail & Related papers (2025-02-07T16:07:51Z)
GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction [67.81475355852997]
3D occupancy prediction is important for autonomous driving due to its comprehensive perception of the surroundings.<n>We propose a world-model-based framework to exploit the scene evolution for perception.<n>Our framework improves the performance of the single-frame counterpart by over 2% in mIoU without introducing additional computations.
arXiv Detail & Related papers (2024-12-13T18:59:54Z)
EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding [72.96388875744704]
3D occupancy prediction provides a comprehensive description of the surrounding scenes.<n>Most existing methods focus on offline perception from one or a few views.<n>We formulate an embodied 3D occupancy prediction task to target this practical scenario and propose a Gaussian-based EmbodiedOcc framework to accomplish it.
arXiv Detail & Related papers (2024-12-05T17:57:09Z)
AGS-Mesh: Adaptive Gaussian Splatting and Meshing with Geometric Priors for Indoor Room Reconstruction Using Smartphones [19.429461194706786]
We propose an approach for joint surface depth and normal refinement of Gaussian Splatting methods for accurate 3D reconstruction of indoor scenes.<n>Our filtering strategy and optimization design demonstrate significant improvements in both mesh estimation and novel-view synthesis.
arXiv Detail & Related papers (2024-11-28T17:04:32Z)
Real-Time 3D Occupancy Prediction via Geometric-Semantic Disentanglement [8.592248643229675]
Occupancy prediction plays a pivotal role in autonomous driving (AD) Existing methods often incur high computational costs, which contradicts the real-time demands of AD. We propose a Geometric-Semantic Dual-Branch Network (GSDBN) with a hybrid BEV-Voxel representation.
arXiv Detail & Related papers (2024-07-18T04:46:13Z)
Probabilistic and Geometric Depth: Detecting Objects in Perspective [78.00922683083776]
3D object detection is an important capability needed in various practical applications such as driver assistance systems. Monocular 3D detection, as an economical solution compared to conventional settings relying on binocular vision or LiDAR, has drawn increasing attention recently but still yields unsatisfactory results. This paper first presents a systematic study on this problem and observes that the current monocular 3D detection problem can be simplified as an instance depth estimation problem.
arXiv Detail & Related papers (2021-07-29T16:30:33Z)
CodeVIO: Visual-Inertial Odometry with Learned Optimizable Dense Depth [83.77839773394106]
We present a lightweight, tightly-coupled deep depth network and visual-inertial odometry system. We provide the network with previously marginalized sparse features from VIO to increase the accuracy of initial depth prediction. We show that it can run in real-time with single-thread execution while utilizing GPU acceleration only for the network and code Jacobian.
arXiv Detail & Related papers (2020-12-18T09:42:54Z)
3D Sketch-aware Semantic Scene Completion via Semi-supervised Structure Prior [50.73148041205675]
The goal of the Semantic Scene Completion (SSC) task is to simultaneously predict a completed 3D voxel representation of volumetric occupancy and semantic labels of objects in the scene from a single-view observation. We propose to devise a new geometry-based strategy to embed depth information with low-resolution voxel representation. Our proposed geometric embedding works better than the depth feature learning from habitual SSC frameworks.
arXiv Detail & Related papers (2020-03-31T09:33:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.