Related papers: GVSynergy-Det: Synergistic Gaussian-Voxel Representations for Multi-View 3D Object Detection

GVSynergy-Det: Synergistic Gaussian-Voxel Representations for Multi-View 3D Object Detection

URL: http://arxiv.org/abs/2512.23176v1
Date: Mon, 29 Dec 2025 03:34:39 GMT
Title: GVSynergy-Det: Synergistic Gaussian-Voxel Representations for Multi-View 3D Object Detection
Authors: Yi Zhang, Yi Wang, Lei Yao, Lap-Pui Chau,
Abstract summary: Image-based 3D object detection aims to identify and localize objects in 3D space using only RGB images.<n>Existing image-based approaches face two critical challenges: methods achieving high accuracy typically require dense 3D supervision.<n>We present GVSynergy-Det, a novel framework that enhances 3D detection through synergistic Gaussian-Voxel representation learning.
Score: 18.809986709717446
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image-based 3D object detection aims to identify and localize objects in 3D space using only RGB images, eliminating the need for expensive depth sensors required by point cloud-based methods. Existing image-based approaches face two critical challenges: methods achieving high accuracy typically require dense 3D supervision, while those operating without such supervision struggle to extract accurate geometry from images alone. In this paper, we present GVSynergy-Det, a novel framework that enhances 3D detection through synergistic Gaussian-Voxel representation learning. Our key insight is that continuous Gaussian and discrete voxel representations capture complementary geometric information: Gaussians excel at modeling fine-grained surface details while voxels provide structured spatial context. We introduce a dual-representation architecture that: 1) adapts generalizable Gaussian Splatting to extract complementary geometric features for detection tasks, and 2) develops a cross-representation enhancement mechanism that enriches voxel features with geometric details from Gaussian fields. Unlike previous methods that either rely on time-consuming per-scene optimization or utilize Gaussian representations solely for depth regularization, our synergistic strategy directly leverages features from both representations through learnable integration, enabling more accurate object localization. Extensive experiments demonstrate that GVSynergy-Det achieves state-of-the-art results on challenging indoor benchmarks, significantly outperforming existing methods on both ScanNetV2 and ARKitScenes datasets, all without requiring any depth or dense 3D geometry supervision (e.g., point clouds or TSDF).

Related papers

TIGaussian: Disentangle Gaussians for Spatial-Awared Text-Image-3D Alignment [58.46706158310462]
TIGaussian harnesses 3D Gaussian Splatting (3DGS) characteristics to strengthen cross-modality alignment.<n>Our multi-branch 3DGS tokenizer decouples the intrinsic properties of 3DGS structures into compact latent representations.<n>A text-3D projection module adaptively maps 3D features to text embedding space for better text-3D alignment.
arXiv Detail & Related papers (2026-01-27T06:30:32Z)
C3G: Learning Compact 3D Representations with 2K Gaussians [55.04010158339562]
Recent approaches use per-pixel 3D Gaussian Splatting for reconstruction, followed by a 2D-to-3D feature lifting stage for scene understanding.<n>We propose C3G, a novel feed-forward framework that estimates compact 3D Gaussians only at essential spatial locations.
arXiv Detail & Related papers (2025-12-03T17:59:05Z)
GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering [50.675710727721786]
We propose GauSSmart, a hybrid method that bridges 2D foundational models and 3D Gaussian Splatting reconstruction.<n>Our approach integrates established 2D computer vision techniques, including convex filtering and semantic feature supervision.<n>We validate our approach across three datasets, where GauSSmart consistently outperforms existing Gaussian Splatting.
arXiv Detail & Related papers (2025-10-16T03:38:26Z)
Hi^2-GSLoc: Dual-Hierarchical Gaussian-Specific Visual Relocalization for Remote Sensing [6.997091164331322]
Visual relocalization is fundamental to remote sensing and UAV applications.<n>Existing methods face inherent trade-offs: image-based retrieval and pose regression approaches lack precision.<n>We introduce $mathrmHi2$-GSLoc, a dual-hierarchical relocalization framework that follows a sparse-to-dense and coarse-to-fine paradigm.
arXiv Detail & Related papers (2025-07-21T14:47:56Z)
ODG: Occupancy Prediction Using Dual Gaussians [38.9869091446875]
Occupancy prediction infers fine-grained 3D geometry and semantics from camera images of the surrounding environment.<n>Existing methods either adopt dense grids as scene representation, or learn the entire scene using a single set of sparse queries.<n>We present ODG, a hierarchical dual sparse Gaussian representation to effectively capture complex scene dynamics.
arXiv Detail & Related papers (2025-06-11T06:03:03Z)
GaussianGraph: 3D Gaussian-based Scene Graph Generation for Open-world Scene Understanding [20.578106363482018]
We propose a novel framework that enhances 3DGS-based scene understanding by integrating semantic clustering and scene graph generation.<n>We introduce a "Control-Follow" clustering strategy, which dynamically adapts to scene scale and feature distribution, avoiding feature compression.<n>We enrich scene representation by integrating object attributes and spatial relations extracted from 2D foundation models.
arXiv Detail & Related papers (2025-03-06T02:36:59Z)
GPS-Gaussian+: Generalizable Pixel-wise 3D Gaussian Splatting for Real-Time Human-Scene Rendering from Sparse Views [67.34073368933814]
We propose a generalizable Gaussian Splatting approach for high-resolution image rendering under a sparse-view camera setting. We train our Gaussian parameter regression module on human-only data or human-scene data, jointly with a depth estimation module to lift 2D parameter maps to 3D space. Experiments on several datasets demonstrate that our method outperforms state-of-the-art methods while achieving an exceeding rendering speed.
arXiv Detail & Related papers (2024-11-18T08:18:44Z)
GeoLRM: Geometry-Aware Large Reconstruction Model for High-Quality 3D Gaussian Generation [65.33726478659304]
We introduce the Geometry-Aware Large Reconstruction Model (GeoLRM), an approach which can predict high-quality assets with 512k Gaussians and 21 input images in only 11 GB GPU memory. Previous works neglect the inherent sparsity of 3D structure and do not utilize explicit geometric relationships between 3D and 2D images. GeoLRM tackles these issues by incorporating a novel 3D-aware transformer structure that directly processes 3D points and uses deformable cross-attention mechanisms.
arXiv Detail & Related papers (2024-06-21T17:49:31Z)
Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting [27.974762304763694]
We introduce Semantic Gaussians, a novel open-vocabulary scene understanding approach based on 3D Gaussian Splatting. Unlike existing methods, we design a versatile projection approach that maps various 2D semantic features into a novel semantic component of 3D Gaussians. We build a 3D semantic network that directly predicts the semantic component from raw 3D Gaussians for fast inference.
arXiv Detail & Related papers (2024-03-22T21:28:19Z)
GeoGS3D: Single-view 3D Reconstruction via Geometric-aware Diffusion Model and Gaussian Splatting [81.03553265684184]
We introduce GeoGS3D, a framework for reconstructing detailed 3D objects from single-view images. We propose a novel metric, Gaussian Divergence Significance (GDS), to prune unnecessary operations during optimization. Experiments demonstrate that GeoGS3D generates images with high consistency across views and reconstructs high-quality 3D objects.
arXiv Detail & Related papers (2024-03-15T12:24:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.