Related papers: FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

URL: http://arxiv.org/abs/2509.02560v1
Date: Tue, 02 Sep 2025 17:54:21 GMT
Title: FastVGGT: Training-Free Acceleration of Visual Geometry Transformer
Authors: You Shen, Zhipeng Zhang, Yansong Qu, Liujuan Cao,
Abstract summary: VGGT is a state-of-the-art feed-forward visual geometry model.<n>We propose FastVGGT, which leverages token merging in the 3D domain through a training-free mechanism for accelerating VGGT.<n>With 1000 input images, FastVGGT achieves a 4x speedup over VGGT while mitigating error accumulation in long-sequence scenarios.
Score: 45.31920631559476
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, scaling these models to long-sequence image inputs remains a significant challenge due to inference-time inefficiency. In this work, we present a detailed analysis of VGGT, a state-of-the-art feed-forward visual geometry model and identify its primary bottleneck. Visualization further reveals a token collapse phenomenon in the attention maps. Motivated by these findings, we explore the potential of token merging in the feed-forward visual geometry model. Owing to the unique architectural and task-specific properties of 3D models, directly applying existing merging techniques proves challenging. To this end, we propose FastVGGT, which, for the first time, leverages token merging in the 3D domain through a training-free mechanism for accelerating VGGT. we devise a unique token partitioning strategy tailored to 3D architectures and tasks, effectively eliminating redundant computation while preserving VGGT's powerful reconstruction capacity. Extensive experiments on multiple 3D geometry benchmarks validate the effectiveness of our approach. Notably, with 1000 input images, FastVGGT achieves a 4x speedup over VGGT while mitigating error accumulation in long-sequence scenarios. These findings underscore the potential of token merging as a principled solution for scalable 3D vision systems. Code is available at: https://mystorm16.github.io/fastvggt/.

Related papers

Sparse View Distractor-Free Gaussian Splatting [31.812029183156245]
3D Gaussian Splatting (3DGS) enables efficient training and fast novel view in static environments.<n>We propose a framework to enhance distractor-free 3DGS under sparse-view conditions by incorporating rich prior information.
arXiv Detail & Related papers (2026-03-02T08:32:32Z)
Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment [15.822150318879052]
We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment.<n>We train a lightweight feature adapter using a reprojection-based consistency loss.<n>This enables state-of-the-art performance in both NVS and camera pose estimation.
arXiv Detail & Related papers (2025-12-09T18:59:52Z)
Wonder3D++: Cross-domain Diffusion for High-fidelity 3D Generation from a Single Image [68.55613894952177]
We introduce textbfWonder3D++, a novel method for efficiently generating high-fidelity textured meshes from single-view images.<n>We propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images.<n> Lastly, we introduce a cascaded 3D mesh extraction algorithm that drives high-quality surfaces from the multi-view 2D representations in only about $3$ minute in a coarse-to-fine manner.
arXiv Detail & Related papers (2025-11-03T17:24:18Z)
IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction [82.53307702809606]
Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions.<n>We propose InstanceGrounded Geometry Transformer (IGGT) to unify the knowledge for both spatial reconstruction and instance-level contextual understanding.
arXiv Detail & Related papers (2025-10-26T14:57:44Z)
VGGT-X: When VGGT Meets Dense Novel View Synthesis [27.397168758449904]
We study the problem of applying 3D Foundation Models (3DFMs) to dense Novel View Synthesis (NVS)<n>Our study reveals that naively scaling 3DFMs to dense views encounters two fundamental barriers: dramatically increasing VRAM burden and imperfect outputs that degrade 3D training.<n>We introduce VGGT-X, incorporating a memory-efficient VGGT implementation that scales to 1,000+ images, an adaptive global alignment for VGGT output enhancement, and robust 3DGS training practices.
arXiv Detail & Related papers (2025-09-29T17:59:59Z)
Geometry-aware Temporal Aggregation Network for Monocular 3D Lane Detection [62.27919334393825]
We propose a novel Geometry-aware Temporal Aggregation Network (GTA-Net) for monocular 3D lane detection.<n>On one hand, we develop the Temporal Geometry Enhancement Module (TGEM), which exploits geometric consistency across successive frames.<n>On the other hand, we present the Temporal Instance-aware Query Generation (TIQG), which strategically incorporates temporal cues into query generation.
arXiv Detail & Related papers (2025-04-29T08:10:17Z)
Interpretable Single-View 3D Gaussian Splatting using Unsupervised Hierarchical Disentangled Representation Learning [46.85417907244265]
We propose an interpretable single-view 3DGS framework, termed 3DisGS, to discover both coarse- and fine-grained 3D semantics.<n>Our model achieves 3D disentanglement while preserving high-quality and rapid reconstruction.
arXiv Detail & Related papers (2025-04-05T14:42:13Z)
EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis [61.1662426227688]
Existing NeRF and 3DGS-based methods show promising results in achieving photorealistic renderings but require slow, per-scene optimization.<n>We introduce EVolSplat, an efficient 3D Gaussian Splatting model for urban scenes that works in a feed-forward manner.
arXiv Detail & Related papers (2025-03-26T02:47:27Z)
VGGT: Visual Geometry Grounded Transformer [61.37669770946458]
VGGT is a feed-forward neural network that directly infers all key 3D attributes of a scene.<n>Network achieves state-of-the-art results in multiple 3D tasks.
arXiv Detail & Related papers (2025-03-14T17:59:47Z)
GeoLRM: Geometry-Aware Large Reconstruction Model for High-Quality 3D Gaussian Generation [65.33726478659304]
We introduce the Geometry-Aware Large Reconstruction Model (GeoLRM), an approach which can predict high-quality assets with 512k Gaussians and 21 input images in only 11 GB GPU memory. Previous works neglect the inherent sparsity of 3D structure and do not utilize explicit geometric relationships between 3D and 2D images. GeoLRM tackles these issues by incorporating a novel 3D-aware transformer structure that directly processes 3D points and uses deformable cross-attention mechanisms.
arXiv Detail & Related papers (2024-06-21T17:49:31Z)
FastOcc: Accelerating 3D Occupancy Prediction by Fusing the 2D Bird's-Eye View and Perspective View [46.81548000021799]
In autonomous driving, 3D occupancy prediction outputs voxel-wise status and semantic labels for more comprehensive understandings of 3D scenes. Recent researchers have extensively explored various aspects of this task, including view transformation techniques, ground-truth label generation, and elaborate network design. A new method, dubbed FastOcc, is proposed to accelerate the model while keeping its accuracy. Experiments on the Occ3D-nuScenes benchmark demonstrate that our FastOcc achieves a fast inference speed.
arXiv Detail & Related papers (2024-03-05T07:01:53Z)
3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features [70.50665869806188]
3DiffTection is a state-of-the-art method for 3D object detection from single images. We fine-tune a diffusion model to perform novel view synthesis conditioned on a single image. We further train the model on target data with detection supervision.
arXiv Detail & Related papers (2023-11-07T23:46:41Z)
CVFNet: Real-time 3D Object Detection by Learning Cross View Features [11.402076835949824]
We present a real-time view-based single stage 3D object detector, namely CVFNet. We first propose a novel Point-Range feature fusion module that deeply integrates point and range view features in multiple stages. Then, a special Slice Pillar is designed to well maintain the 3D geometry when transforming the obtained deep point-view features into bird's eye view.
arXiv Detail & Related papers (2022-03-13T06:23:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.