ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training
- URL: http://arxiv.org/abs/2603.04385v1
- Date: Wed, 04 Mar 2026 18:49:37 GMT
- Title: ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training
- Authors: Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, Aleksander Holynski,
- Abstract summary: We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods.<n>ZipMap employs test-time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass.<n>We demonstrate the benefits of having a stateful representation in real-time scene-state querying and its extension to sequential streaming reconstruction.
- Score: 100.29965188088966
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and $π^3$ have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential-reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods. ZipMap employs test-time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU, more than $20\times$ faster than state-of-the-art methods such as VGGT. Moreover, we demonstrate the benefits of having a stateful representation in real-time scene-state querying and its extension to sequential streaming reconstruction.
Related papers
- VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale [44.72105958250334]
We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods.<n>Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry.<n>VGG-T$3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds.
arXiv Detail & Related papers (2026-02-26T18:59:33Z) - Link to the Past: Temporal Propagation for Fast 3D Human Reconstruction from Monocular Video [3.065513003860787]
We present TemPoFast3D, a novel method that leverages temporal coherency of human appearance to reduce redundant computation.<n>Our approach is a "plug-and play" solution that transforms pixel-aligned reconstruction networks to handle continuous video streams.<n>Extensive experiments demonstrate that TemPoFast3D matches or exceeds state-of-the-art methods across standard metrics.
arXiv Detail & Related papers (2025-05-12T08:16:19Z) - FlowR: Flowing from Sparse to Dense 3D Reconstructions [60.28571003356382]
We propose a flow matching model that learns a flow to connect novel view renderings from possibly sparse reconstructions to renderings that we expect from dense reconstructions.<n>Our model is trained on a novel dataset of 3.6M image pairs and can process up to 45 views at 540x960 resolution (91K tokens) on one H100 GPU in a single forward pass.
arXiv Detail & Related papers (2025-04-02T11:57:01Z) - Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass [68.78222900840132]
We propose Fast 3D Reconstruction (Fast3R), a novel multi-view generalization to DUSt3R that achieves efficient and scalable 3D reconstruction by processing many views in parallel.<n>Fast3R demonstrates state-of-the-art performance, with significant improvements in inference speed and reduced error accumulation.
arXiv Detail & Related papers (2025-01-23T18:59:55Z) - Continuous 3D Perception Model with Persistent State [111.83854602049222]
We present a unified framework capable of solving a broad range of 3D tasks.<n>Our approach features a stateful recurrent model that continuously updates its state representation with each new observation.<n>We evaluate our method on various 3D/4D tasks and demonstrate competitive or state-of-the-art performance in each.
arXiv Detail & Related papers (2025-01-21T18:59:23Z) - Splatter Image: Ultra-Fast Single-View 3D Reconstruction [67.96212093828179]
Splatter Image is based on Gaussian Splatting, which allows fast and high-quality reconstruction of 3D scenes from multiple images.
We learn a neural network that, at test time, performs reconstruction in a feed-forward manner, at 38 FPS.
On several synthetic, real, multi-category and large-scale benchmark datasets, we achieve better results in terms of PSNR, LPIPS, and other metrics while training and evaluating much faster than prior works.
arXiv Detail & Related papers (2023-12-20T16:14:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.