FuguReport

Déjà View: Looping Transformers for Multi-View 3D Reconstruction

Authors Alessandro Burzio, Tobias Fischer, Sven Elflein, Qunjie Zhou, Riccardo de Lutio, Jiawei Ren, Jiahui Huang, Shengyu Huang, Marc Pollefeys, Laura Leal-Taixé, Zan Gojcic, Haithem Turki
Affiliations NVIDIA / University of Modena and Reggio Emilia / ETH Zurich / University of Toronto
Categories Method / 3D Reconstruction / Multi-view iterative feature application, Application / Computer Vision / 3D reconstruction from multiple views, Evaluation / Model Efficiency / Efficiency of looping transformer architecture
License CC BY 4.0

Abstract Overview

The paper presents DéjàView, a multi-view 3D reconstruction model that replaces a deep feed-forward transformer with a single shared transformer block applied recurrently to per-view DINOv2 features. The recurrent block is conditioned on continuous time intervals, and the number of refinement steps K is sampled during training so that one checkpoint can be used at different inference-time compute budgets. The method predicts depth, rays, and camera parameters, and the authors analyze the recurrent dynamics as a form of directional refinement rather than fixed-point convergence. Evaluation is reported on five benchmarks spanning indoor, outdoor, object-centric, and driving scenes.

Novelty

The main novelty is making iterative refinement explicit in a multi-view reconstruction transformer by looping a shared block, instead of relying on many independently parameterized layers to realize refinement implicitly. The paper also introduces continuous time conditioning and variable-K training so the same trained model can trade compute for accuracy at inference, and shows that shared recurrence outperforms an otherwise matched untied per-step variant.

Results

Across DTU, ETH3D, 7-Scenes, ScanNet++, and nuScenes, DéjàView matches or exceeds much larger feed-forward baselines while using 117M parameters, 75.9 TFLOPs, and 4.9 GiB peak memory at 24 views. In the efficiency summary, it attains the best average inlier ratio (80.3) and AUC@30 (91.8), and the pose table shows first- or second-place results on nine of ten benchmark cells. Ablations further show that weight sharing and the proposed gating design improve metrics monotonically, with the fully shared design outperforming a decoupled 16-step alternative despite far fewer parameters.

Key Points

  1. DéjàView formulates multi-view 3D reconstruction as recurrent refinement of per-view tokens with a single shared transformer block conditioned on continuous time intervals.
  2. A single variable-K checkpoint supports different inference budgets, and the model achieves strong cross-benchmark performance with substantially fewer parameters than prior feed-forward transformers.
  3. The analysis indicates monotonic quality improvement across recurrent steps and identifies directional refinement in feature space; moreover, the shared recurrent block performs better than a matched untied per-step design.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.