VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale
- URL: http://arxiv.org/abs/2602.23361v1
- Date: Thu, 26 Feb 2026 18:59:33 GMT
- Title: VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale
- Authors: Sven Elflein, Ruilong Li, Sérgio Agostinho, Zan Gojcic, Laura Leal-Taixé, Qunjie Zhou, Aljosa Osep,
- Abstract summary: We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods.<n>Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry.<n>VGG-T$3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds.
- Score: 44.72105958250334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T$^3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds, achieving a $11.6\times$ speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.
Related papers
- ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training [100.29965188088966]
We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods.<n>ZipMap employs test-time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass.<n>We demonstrate the benefits of having a stateful representation in real-time scene-state querying and its extension to sequential streaming reconstruction.
arXiv Detail & Related papers (2026-03-04T18:49:37Z) - Continuous 3D Perception Model with Persistent State [111.83854602049222]
We present a unified framework capable of solving a broad range of 3D tasks.<n>Our approach features a stateful recurrent model that continuously updates its state representation with each new observation.<n>We evaluate our method on various 3D/4D tasks and demonstrate competitive or state-of-the-art performance in each.
arXiv Detail & Related papers (2025-01-21T18:59:23Z) - 3DMiner: Discovering Shapes from Large-Scale Unannotated Image Datasets [34.610546020800236]
3DMiner is a pipeline for mining 3D shapes from challenging datasets.
Our method is capable of producing significantly better results than state-of-the-art unsupervised 3D reconstruction techniques.
We show how 3DMiner can be applied to in-the-wild data by reconstructing shapes present in images from the LAION-5B dataset.
arXiv Detail & Related papers (2023-10-29T23:08:19Z) - Visual Localization using Imperfect 3D Models from the Internet [54.731309449883284]
This paper studies how imperfections in 3D models affect localization accuracy.
We show that 3D models from the Internet show promise as an easy-to-obtain scene representation.
arXiv Detail & Related papers (2023-04-12T16:15:05Z) - Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular
Video Depth [90.33296913575818]
In some video-based scenarios such as video depth estimation and 3D scene reconstruction from a video, the unknown scale and shift residing in per-frame prediction may cause the depth inconsistency.
We propose a locally weighted linear regression method to recover the scale and shift with very sparse anchor points.
Our method can boost the performance of existing state-of-the-art approaches by 50% at most over several zero-shot benchmarks.
arXiv Detail & Related papers (2022-02-03T08:52:54Z) - Detailed Facial Geometry Recovery from Multi-view Images by Learning an
Implicit Function [12.522283941978722]
We propose a novel architecture to recover extremely detailed 3D faces in roughly 10 seconds.
By fitting a 3D morphable model from multi-view images, the features of multiple images are extracted and aggregated in the mesh-attached UV space.
Our method outperforms SOTA learning-based MVS in accuracy by a large margin on the FaceScape dataset.
arXiv Detail & Related papers (2022-01-04T07:24:58Z) - Soft Expectation and Deep Maximization for Image Feature Detection [68.8204255655161]
We propose SEDM, an iterative semi-supervised learning process that flips the question and first looks for repeatable 3D points, then trains a detector to localize them in image space.
Our results show that this new model trained using SEDM is able to better localize the underlying 3D points in a scene.
arXiv Detail & Related papers (2021-04-21T00:35:32Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.