VGGT-X: When VGGT Meets Dense Novel View Synthesis
- URL: http://arxiv.org/abs/2509.25191v2
- Date: Wed, 08 Oct 2025 06:29:47 GMT
- Title: VGGT-X: When VGGT Meets Dense Novel View Synthesis
- Authors: Yang Liu, Chuanchen Luo, Zimo Tang, Junran Peng, Zhaoxiang Zhang,
- Abstract summary: We study the problem of applying 3D Foundation Models (3DFMs) to dense Novel View Synthesis (NVS)<n>Our study reveals that naively scaling 3DFMs to dense views encounters two fundamental barriers: dramatically increasing VRAM burden and imperfect outputs that degrade 3D training.<n>We introduce VGGT-X, incorporating a memory-efficient VGGT implementation that scales to 1,000+ images, an adaptive global alignment for VGGT output enhancement, and robust 3DGS training practices.
- Score: 27.397168758449904
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We study the problem of applying 3D Foundation Models (3DFMs) to dense Novel View Synthesis (NVS). Despite significant progress in Novel View Synthesis powered by NeRF and 3DGS, current approaches remain reliant on accurate 3D attributes (e.g., camera poses and point clouds) acquired from Structure-from-Motion (SfM), which is often slow and fragile in low-texture or low-overlap captures. Recent 3DFMs showcase orders of magnitude speedup over the traditional pipeline and great potential for online NVS. But most of the validation and conclusions are confined to sparse-view settings. Our study reveals that naively scaling 3DFMs to dense views encounters two fundamental barriers: dramatically increasing VRAM burden and imperfect outputs that degrade initialization-sensitive 3D training. To address these barriers, we introduce VGGT-X, incorporating a memory-efficient VGGT implementation that scales to 1,000+ images, an adaptive global alignment for VGGT output enhancement, and robust 3DGS training practices. Extensive experiments show that these measures substantially close the fidelity gap with COLMAP-initialized pipelines, achieving state-of-the-art results in dense COLMAP-free NVS and pose estimation. Additionally, we analyze the causes of remaining gaps with COLMAP-initialized rendering, providing insights for the future development of 3D foundation models and dense NVS. Our project page is available at https://dekuliutesla.github.io/vggt-x.github.io/
Related papers
- Sparse View Distractor-Free Gaussian Splatting [31.812029183156245]
3D Gaussian Splatting (3DGS) enables efficient training and fast novel view in static environments.<n>We propose a framework to enhance distractor-free 3DGS under sparse-view conditions by incorporating rich prior information.
arXiv Detail & Related papers (2026-03-02T08:32:32Z) - Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment [15.822150318879052]
We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment.<n>We train a lightweight feature adapter using a reprojection-based consistency loss.<n>This enables state-of-the-art performance in both NVS and camera pose estimation.
arXiv Detail & Related papers (2025-12-09T18:59:52Z) - DWGS: Enhancing Sparse-View Gaussian Splatting with Hybrid-Loss Depth Estimation and Bidirectional Warping [8.67235980460198]
Novel View Synthesis from sparse views remains a core challenge in 3D reconstruction.<n>We propose DWGS, a novel unified framework that enhances 3DGS for sparse-view synthesis.<n>We show that DWGS achieves a new state-of-the-art, achieving up to 21.13 dB PSNR and 0.189 LPIPS, while retaining real-time inference capabilities.
arXiv Detail & Related papers (2025-09-29T15:03:31Z) - FastVGGT: Training-Free Acceleration of Visual Geometry Transformer [45.31920631559476]
VGGT is a state-of-the-art feed-forward visual geometry model.<n>We propose FastVGGT, which leverages token merging in the 3D domain through a training-free mechanism for accelerating VGGT.<n>With 1000 input images, FastVGGT achieves a 4x speedup over VGGT while mitigating error accumulation in long-sequence scenarios.
arXiv Detail & Related papers (2025-09-02T17:54:21Z) - Fast Learning of Non-Cooperative Spacecraft 3D Models through Primitive Initialization [3.686808512438363]
This work contributes a Convolutional Neural Network (CNN) based primitive initializer for 3DGS using monocular images.<n>A CNN takes a single image as input and outputs a coarse 3D model represented as an assembly of primitives, along with the target's pose relative to the camera.<n>This work performs a comparison between these variants, evaluating their effectiveness for downstream 3DGS training under noisy or implicit pose estimates.
arXiv Detail & Related papers (2025-07-25T17:43:29Z) - SparSplat: Fast Multi-View Reconstruction with Generalizable 2D Gaussian Splatting [7.9061560322289335]
We propose an MVS-based learning that regresses 2DGS surface parameters in a feed-forward fashion to perform 3D shape reconstruction and NVS from sparse-view images.<n>The resulting pipeline attains the state-of-the-art results on the DTU 3D reconstruction benchmark in terms of Chamfer distance to ground-truth, as-well as state-of-the-art NVS.
arXiv Detail & Related papers (2025-05-04T16:33:47Z) - EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis [61.1662426227688]
Existing NeRF and 3DGS-based methods show promising results in achieving photorealistic renderings but require slow, per-scene optimization.<n>We introduce EVolSplat, an efficient 3D Gaussian Splatting model for urban scenes that works in a feed-forward manner.
arXiv Detail & Related papers (2025-03-26T02:47:27Z) - Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding [59.51535163599723]
FreeGS is an unsupervised semantic-embedded 3DGS framework that achieves view-consistent 3D scene understanding without the need for 2D labels.<n>FreeGS performs comparably to state-of-the-art methods while avoiding the complex data preprocessing workload.
arXiv Detail & Related papers (2024-11-29T08:52:32Z) - PF3plat: Pose-Free Feed-Forward 3D Gaussian Splatting [54.7468067660037]
PF3plat sets a new state-of-the-art across all benchmarks, supported by comprehensive ablation studies validating our design choices.<n>Our framework capitalizes on fast speed, scalability, and high-quality 3D reconstruction and view synthesis capabilities of 3DGS.
arXiv Detail & Related papers (2024-10-29T15:28:15Z) - MVPGS: Excavating Multi-view Priors for Gaussian Splatting from Sparse Input Views [27.47491233656671]
Novel View Synthesis (NVS) is a significant challenge in 3D vision applications.
We propose textbfMVPGS, a few-shot NVS method that excavates the multi-view priors based on 3D Gaussian Splatting.
Experiments show that the proposed method achieves state-of-the-art performance with real-time rendering speed.
arXiv Detail & Related papers (2024-09-22T05:07:20Z) - DOGS: Distributed-Oriented Gaussian Splatting for Large-Scale 3D Reconstruction Via Gaussian Consensus [56.45194233357833]
We propose DoGaussian, a method that trains 3DGS distributedly.
Our method accelerates the training of 3DGS by 6+ times when evaluated on large-scale scenes.
arXiv Detail & Related papers (2024-05-22T19:17:58Z) - SGD: Street View Synthesis with Gaussian Splatting and Diffusion Prior [53.52396082006044]
Current methods struggle to maintain rendering quality at the viewpoint that deviates significantly from the training viewpoints.
This issue stems from the sparse training views captured by a fixed camera on a moving vehicle.
We propose a novel approach that enhances the capacity of 3DGS by leveraging prior from a Diffusion Model.
arXiv Detail & Related papers (2024-03-29T09:20:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.