Related papers: Understanding Multi-View Transformers

Understanding Multi-View Transformers

URL: http://arxiv.org/abs/2510.24907v1
Date: Tue, 28 Oct 2025 19:19:35 GMT
Title: Understanding Multi-View Transformers
Authors: Michal Stary, Julien Gaubil, Ayush Tewari, Vincent Sitzmann,
Abstract summary: Multi-view transformers such as DUSt3R are revolutionizing 3D vision by solving 3D tasks in a feed-forward manner.<n>Here, we present an approach for probing and visualizing 3D representations from the residual connections of the multi-view transformers' layers.
Score: 18.573401296925844
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-view transformers such as DUSt3R are revolutionizing 3D vision by solving 3D tasks in a feed-forward manner. However, contrary to previous optimization-based pipelines, the inner mechanisms of multi-view transformers are unclear. Their black-box nature makes further improvements beyond data scaling challenging and complicates usage in safety- and reliability-critical applications. Here, we present an approach for probing and visualizing 3D representations from the residual connections of the multi-view transformers' layers. In this manner, we investigate a variant of the DUSt3R model, shedding light on the development of its latent state across blocks, the role of the individual layers, and suggest how it differs from methods with stronger inductive biases of explicit global pose. Finally, we show that the investigated variant of DUSt3R estimates correspondences that are refined with reconstructed geometry. The code used for the analysis is available at https://github.com/JulienGaubil/und3rstand .

Related papers

STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer [72.88105562624838]
We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem.<n>By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios.<n>Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments.
arXiv Detail & Related papers (2025-08-14T17:58:05Z)
View Transformation Robustness for Multi-View 3D Object Reconstruction with Reconstruction Error-Guided View Selection [18.756000520353187]
view transformation robustness (VTR) is critical for deep-learning-based multi-view 3D object reconstruction models.<n>We propose a reconstruction error-guided view selection method, which considers the reconstruction errors' spatial distribution of the 3D predictions.<n>Experiments demonstrate that the proposed method can outperform state-of-the-art 3D reconstruction methods.
arXiv Detail & Related papers (2024-12-16T03:54:08Z)
GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision [49.839374549646884]
This paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception.<n>Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone.
arXiv Detail & Related papers (2024-05-17T07:31:20Z)
MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation [54.27399121779011]
We present MVD-Fusion: a method for single-view 3D inference via generative modeling of multi-view-consistent RGB-D images. We show that our approach can yield more accurate synthesis compared to recent state-of-the-art, including distillation-based 3D inference and prior multi-view generation methods.
arXiv Detail & Related papers (2024-04-04T17:59:57Z)
3D Vision with Transformers: A Survey [114.86385193388439]
The success of the transformer architecture in natural language processing has triggered attention in the computer vision field. We present a systematic and thorough review of more than 100 transformers methods for different 3D vision tasks. We discuss transformer design in 3D vision, which allows it to process data with various 3D representations.
arXiv Detail & Related papers (2022-08-08T17:59:11Z)
3D-C2FT: Coarse-to-fine Transformer for Multi-view 3D Reconstruction [14.89364490991374]
This paper proposes a new model, namely 3D coarse-to-fine transformer (3D-C2FT), for encoding multi-view features and rectifying defective 3D objects. C2F attention mechanism enables the model to learn multi-view information flow and synthesize 3D surface correction in a coarse to fine-grained manner. Experimental results show that 3D-C2FT achieves notable results and outperforms several competing models on these datasets.
arXiv Detail & Related papers (2022-05-29T06:01:42Z)
Geometry-Contrastive Transformer for Generalized 3D Pose Transfer [95.56457218144983]
The intuition of this work is to perceive the geometric inconsistency between the given meshes with the powerful self-attention mechanism. We propose a novel geometry-contrastive Transformer that has an efficient 3D structured perceiving ability to the global geometric inconsistencies. We present a latent isometric regularization module together with a novel semi-synthesized dataset for the cross-dataset 3D pose transfer task.
arXiv Detail & Related papers (2021-12-14T13:14:24Z)
Multi-view 3D Reconstruction with Transformer [34.756336770583154]
We reformulate the multi-view 3D reconstruction as a sequence-to-sequence prediction problem. We propose a new framework named 3D Volume Transformer (VolT) for such a task. Our method achieves a new state-of-the-art accuracy in multi-view reconstruction with fewer parameters.
arXiv Detail & Related papers (2021-03-24T03:14:49Z)
Self-Supervised Multi-View Learning via Auto-Encoding 3D Transformations [61.870882736758624]
We propose a novel self-supervised paradigm to learn Multi-View Transformation Equivariant Representations (MV-TER) Specifically, we perform a 3D transformation on a 3D object, and obtain multiple views before and after the transformation via projection. Then, we self-train a representation to capture the intrinsic 3D object representation by decoding 3D transformation parameters from the fused feature representations of multiple views before and after the transformation.
arXiv Detail & Related papers (2021-03-01T06:24:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.