On Geometric Understanding and Learned Data Priors in VGGT
- URL: http://arxiv.org/abs/2512.11508v1
- Date: Fri, 12 Dec 2025 12:11:57 GMT
- Title: On Geometric Understanding and Learned Data Priors in VGGT
- Authors: Jelena Bratulić, Sudhanshu Mittal, Thomas Brox, Christian Rupprecht,
- Abstract summary: The Visual Geometry Grounded Transformer (VGGT) is a 3D foundation model that infers camera geometry and scene structure in a single feed-forward pass.<n>We conduct a systematic analysis of VGGT's internal mechanisms to uncover whether geometric understanding emerges within its representations.
- Score: 38.8968170074396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Visual Geometry Grounded Transformer (VGGT) is a 3D foundation model that infers camera geometry and scene structure in a single feed-forward pass. Trained in a supervised, single-step fashion on large datasets, VGGT raises a key question: does it build upon geometric concepts like traditional multi-view methods, or does it rely primarily on learned appearance-based data-driven priors? In this work, we conduct a systematic analysis of VGGT's internal mechanisms to uncover whether geometric understanding emerges within its representations. By probing intermediate features, analyzing attention patterns, and performing interventions, we examine how the model implements its functionality. Our findings reveal that VGGT implicitly performs correspondence matching within its global attention layers and encodes epipolar geometry, despite being trained without explicit geometric constraints. We further investigate VGGT's dependence on its learned data priors. Using spatial input masking and perturbation experiments, we assess its robustness to occlusions, appearance variations, and camera configurations, comparing it with classical multi-stage pipelines. Together, these insights highlight how VGGT internalizes geometric structure while using learned data-driven priors.
Related papers
- VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection [36.17507198972377]
Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain.<n>We present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection.
arXiv Detail & Related papers (2026-03-01T04:25:52Z) - GPA-VGGT:Adapting VGGT to Large Scale Localization by Self-Supervised Learning with Geometry and Physics Aware Loss [15.633839321933385]
Recent advancements in Visual Geometry Grounded Transformer (VGGT) models have shown great promise in camera pose estimation and 3D reconstruction.<n>These models typically rely on ground truth labels for training, posing challenges when adapting to unlabeled and unseen scenes.<n>We propose a self-supervised framework to train VGGT with unlabeled data, thereby enhancing its localization capability in large-scale environments.
arXiv Detail & Related papers (2026-01-23T16:46:59Z) - DVGT: Driving Visual Geometry Transformer [63.38483879291505]
A driving-targeted dense geometry perception model can adapt to different scenarios and camera configurations.<n>We propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs.<n>DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations.
arXiv Detail & Related papers (2025-12-18T18:59:57Z) - Dense Semantic Matching with VGGT Prior [49.42199006453071]
We propose an approach that retains VGGT's intrinsic strengths by reusing early feature stages, fine-tuning later ones, and adding a semantic head for bidirectional correspondences.<n>Our approach achieves superior geometry awareness, matching reliability, and manifold preservation, outperforming previous baselines.
arXiv Detail & Related papers (2025-09-25T14:56:11Z) - FastVGGT: Training-Free Acceleration of Visual Geometry Transformer [83.67766078575782]
VGGT is a state-of-the-art feed-forward visual geometry model.<n>We propose FastVGGT, which leverages token merging in the 3D domain through a training-free mechanism for accelerating VGGT.<n>With 1000 input images, FastVGGT achieves a 4x speedup over VGGT while mitigating error accumulation in long-sequence scenarios.
arXiv Detail & Related papers (2025-09-02T17:54:21Z) - Analyzing Deep Transformer Models for Time Series Forecasting via Manifold Learning [4.910937238451485]
Transformer models have consistently achieved remarkable results in various domains such as natural language processing and computer vision.
Despite ongoing research efforts to better understand these models, the field still lacks a comprehensive understanding.
Time series data, unlike image and text information, can be more challenging to interpret and analyze.
arXiv Detail & Related papers (2024-10-17T17:32:35Z) - The geometry of hidden representations of large transformer models [43.16765170255552]
Large transformers are powerful architectures used for self-supervised data analysis across various data types.
We show that the semantic structure of the dataset emerges from a sequence of transformations between one representation and the next.
We show that the semantic information of the dataset is better expressed at the end of the first peak, and this phenomenon can be observed across many models trained on diverse datasets.
arXiv Detail & Related papers (2023-02-01T07:50:26Z) - Self-supervised Geometric Perception [96.89966337518854]
Self-supervised geometric perception is a framework to learn a feature descriptor for correspondence matching without any ground-truth geometric model labels.
We show that SGP achieves state-of-the-art performance that is on-par or superior to the supervised oracles trained using ground-truth labels.
arXiv Detail & Related papers (2021-03-04T15:34:43Z) - Graph Signal Processing for Geometric Data and Beyond: Theory and
Applications [55.81966207837108]
Graph Signal Processing (GSP) enables processing signals that reside on irregular domains.
GSP methodologies for geometric data in a unified manner by bridging the connections between geometric data and graphs.
Recently developed Graph Neural Networks (GNNs) interpret the operation of these networks from the perspective of GSP.
arXiv Detail & Related papers (2020-08-05T03:20:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.