VG3T: Visual Geometry Grounded Gaussian Transformer
- URL: http://arxiv.org/abs/2512.05988v1
- Date: Fri, 28 Nov 2025 07:27:20 GMT
- Title: VG3T: Visual Geometry Grounded Gaussian Transformer
- Authors: Junho Kim, Seongwon Lee,
- Abstract summary: VG3T is a novel multi-view feed-forward network that predicts a 3D semantic occupancy via a 3D Gaussian representation.<n>We show a notable 1.7%p improvement in mIoU while using 46% fewer primitives than the previous state-of-the-art on the nuScenes benchmark.
- Score: 18.15986152198467
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating a coherent 3D scene representation from multi-view images is a fundamental yet challenging task. Existing methods often struggle with multi-view fusion, leading to fragmented 3D representations and sub-optimal performance. To address this, we introduce VG3T, a novel multi-view feed-forward network that predicts a 3D semantic occupancy via a 3D Gaussian representation. Unlike prior methods that infer Gaussians from single-view images, our model directly predicts a set of semantically attributed Gaussians in a joint, multi-view fashion. This novel approach overcomes the fragmentation and inconsistency inherent in view-by-view processing, offering a unified paradigm to represent both geometry and semantics. We also introduce two key components, Grid-Based Sampling and Positional Refinement, to mitigate the distance-dependent density bias common in pixel-aligned Gaussian initialization methods. Our VG3T shows a notable 1.7%p improvement in mIoU while using 46% fewer primitives than the previous state-of-the-art on the nuScenes benchmark, highlighting its superior efficiency and performance.
Related papers
- iGaussian: Real-Time Camera Pose Estimation via Feed-Forward 3D Gaussian Splatting Inversion [62.09575122593993]
iGaussian is a two-stage feed-forward framework that achieves real-time camera pose estimation through direct 3D Gaussian inversion.<n> Experimental results on the NeRF Synthetic, Mip-NeRF 360, and T&T+DB datasets demonstrate a significant performance improvement over previous methods.
arXiv Detail & Related papers (2025-11-18T05:22:22Z) - Wonder3D++: Cross-domain Diffusion for High-fidelity 3D Generation from a Single Image [68.55613894952177]
We introduce textbfWonder3D++, a novel method for efficiently generating high-fidelity textured meshes from single-view images.<n>We propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images.<n> Lastly, we introduce a cascaded 3D mesh extraction algorithm that drives high-quality surfaces from the multi-view 2D representations in only about $3$ minute in a coarse-to-fine manner.
arXiv Detail & Related papers (2025-11-03T17:24:18Z) - Stereo-GS: Multi-View Stereo Vision Model for Generalizable 3D Gaussian Splatting Reconstruction [30.518107360632488]
Generalizable 3D Gaussian Splatting reconstruction showcases advanced Image-to-3D content creation.<n>method provides an efficient, scalable solution for real-world 3D content generation.
arXiv Detail & Related papers (2025-07-20T11:33:13Z) - OGGSplat: Open Gaussian Growing for Generalizable Reconstruction with Expanded Field-of-View [74.58230239274123]
We propose OGGSplat, an open Gaussian growing method that expands the field-of-view in generalizable 3D reconstruction.<n>Our key insight is that the semantic attributes of open Gaussians provide strong priors for image extrapolation.<n> OGGSplat also demonstrates promising semantic-aware scene reconstruction capabilities when provided with two view images captured directly from a smartphone camera.
arXiv Detail & Related papers (2025-06-05T16:17:18Z) - RoGSplat: Learning Robust Generalizable Human Gaussian Splatting from Sparse Multi-View Images [39.03889696169877]
RoGSplat is a novel approach for synthesizing high-fidelity novel views of unseen human from sparse multi-view images.<n>Our method outperforms state-of-the-art methods in novel view synthesis and cross-dataset generalization.
arXiv Detail & Related papers (2025-03-18T12:18:34Z) - F3D-Gaus: Feed-forward 3D-aware Generation on ImageNet with Cycle-Aggregative Gaussian Splatting [35.625593119642424]
This paper tackles the problem of generalizable 3D-aware generation from monocular datasets.<n>We propose a novel feed-forward pipeline based on pixel-aligned Gaussian Splatting.<n>We also introduce a self-supervised cycle-aggregative constraint to enforce cross-view consistency in the learned 3D representation.
arXiv Detail & Related papers (2025-01-12T04:44:44Z) - CrossView-GS: Cross-view Gaussian Splatting For Large-scale Scene Reconstruction [5.528874948395173]
We propose a novel cross-view Gaussian Splatting method for large-scale scene reconstruction based on multi-branch construction and fusion.<n>Our method achieves superior performance in novel view synthesis compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-01-03T08:24:59Z) - NovelGS: Consistent Novel-view Denoising via Large Gaussian Reconstruction Model [57.92709692193132]
NovelGS is a diffusion model for Gaussian Splatting given sparse-view images.
We leverage the novel view denoising through a transformer-based network to generate 3D Gaussians.
arXiv Detail & Related papers (2024-11-25T07:57:17Z) - UniGS: Modeling Unitary 3D Gaussians for Novel View Synthesis from Sparse-view Images [20.089890859122168]
We introduce UniGS, a novel 3D Gaussian reconstruction and novel view synthesis model.<n>UniGS predicts a high-fidelity representation of 3D Gaussians from arbitrary number of posed sparse-view images.
arXiv Detail & Related papers (2024-10-17T03:48:02Z) - MCGS: Multiview Consistency Enhancement for Sparse-View 3D Gaussian Radiance Fields [100.90743697473232]
Radiance fields represented by 3D Gaussians excel at synthesizing novel views, offering both high training efficiency and fast rendering.<n>Existing methods often incorporate depth priors from dense estimation networks but overlook the inherent multi-view consistency in input images.<n>We propose a view synthesis framework based on 3D Gaussian Splatting, enabling scene reconstruction from sparse views.
arXiv Detail & Related papers (2024-10-15T08:39:05Z) - AugGS: Self-augmented Gaussians with Structural Masks for Sparse-view 3D Reconstruction [9.953394373473621]
Sparse-view 3D reconstruction is a major challenge in computer vision.<n>We propose a self-augmented two-stage Gaussian splatting framework enhanced with structural masks for sparse-view 3D reconstruction.<n>Our approach achieves state-of-the-art performance in perceptual quality and multi-view consistency with sparse inputs.
arXiv Detail & Related papers (2024-08-09T03:09:22Z) - MVGamba: Unify 3D Content Generation as State Space Sequence Modeling [150.80564081817786]
We introduce MVGamba, a general and lightweight Gaussian reconstruction model featuring a multi-view Gaussian reconstructor.<n>With off-the-detail multi-view diffusion models integrated, MVGamba unifies 3D generation tasks from a single image, sparse images, or text prompts.<n>Experiments demonstrate that MVGamba outperforms state-of-the-art baselines in all 3D content generation scenarios with approximately only $0.1times$ of the model size.
arXiv Detail & Related papers (2024-06-10T15:26:48Z) - MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images [102.7646120414055]
We introduce MVSplat, an efficient model that, given sparse multi-view images as input, predicts clean feed-forward 3D Gaussians.
On the large-scale RealEstate10K and ACID benchmarks, MVSplat achieves state-of-the-art performance with the fastest feed-forward inference speed (22fps)
arXiv Detail & Related papers (2024-03-21T17:59:58Z) - GeoGS3D: Single-view 3D Reconstruction via Geometric-aware Diffusion Model and Gaussian Splatting [81.03553265684184]
We introduce GeoGS3D, a framework for reconstructing detailed 3D objects from single-view images.
We propose a novel metric, Gaussian Divergence Significance (GDS), to prune unnecessary operations during optimization.
Experiments demonstrate that GeoGS3D generates images with high consistency across views and reconstructs high-quality 3D objects.
arXiv Detail & Related papers (2024-03-15T12:24:36Z) - Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting [9.383423119196408]
We introduce Multi-view ControlNet (MVControl), a novel neural network architecture designed to enhance existing multi-view diffusion models.<n>MVControl is able to offer 3D diffusion guidance for optimization-based 3D generation.<n>In pursuit of efficiency, we adopt 3D Gaussians as our representation instead of the commonly used implicit representations.
arXiv Detail & Related papers (2024-03-15T02:57:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.