Déjà View: Looping Transformers for Multi-View 3D Reconstruction
Abstract Overview
The paper presents DéjàView, a multi-view 3D reconstruction model that replaces a deep feed-forward transformer with a single shared transformer block applied recurrently to per-view DINOv2 features. The recurrent block is conditioned on continuous time intervals, and the number of refinement steps K is sampled during training so that one checkpoint can be used at different inference-time compute budgets. The method predicts depth, rays, and camera parameters, and the authors analyze the recurrent dynamics as a form of directional refinement rather than fixed-point convergence. Evaluation is reported on five benchmarks spanning indoor, outdoor, object-centric, and driving scenes.
Novelty
The main novelty is making iterative refinement explicit in a multi-view reconstruction transformer by looping a shared block, instead of relying on many independently parameterized layers to realize refinement implicitly. The paper also introduces continuous time conditioning and variable-K training so the same trained model can trade compute for accuracy at inference, and shows that shared recurrence outperforms an otherwise matched untied per-step variant.
Results
Across DTU, ETH3D, 7-Scenes, ScanNet++, and nuScenes, DéjàView matches or exceeds much larger feed-forward baselines while using 117M parameters, 75.9 TFLOPs, and 4.9 GiB peak memory at 24 views. In the efficiency summary, it attains the best average inlier ratio (80.3) and AUC@30 (91.8), and the pose table shows first- or second-place results on nine of ten benchmark cells. Ablations further show that weight sharing and the proposed gating design improve metrics monotonically, with the fully shared design outperforming a decoupled 16-step alternative despite far fewer parameters.
Key Points
- DéjàView formulates multi-view 3D reconstruction as recurrent refinement of per-view tokens with a single shared transformer block conditioned on continuous time intervals.
- A single variable-K checkpoint supports different inference budgets, and the model achieves strong cross-benchmark performance with substantially fewer parameters than prior feed-forward transformers.
- The analysis indicates monotonic quality improvement across recurrent steps and identifies directional refinement in feature space; moreover, the shared recurrent block performs better than a matched untied per-step design.