MuM: Multi-View Masked Image Modeling for 3D Vision
- URL: http://arxiv.org/abs/2511.17309v1
- Date: Fri, 21 Nov 2025 15:25:47 GMT
- Title: MuM: Multi-View Masked Image Modeling for 3D Vision
- Authors: David Nordström, Johan Edstedt, Fredrik Kahl, Georg Bökman,
- Abstract summary: Self-supervised learning on images seeks to extract meaningful visual representations from unlabeled data.<n>In this work, we focus on learning features tailored for 3D vision.<n>We extend MAE to arbitrarily many views of the same scene and employ a lightweight decoder with inter-frame attention.<n>We evaluate the resulting model, MuM, extensively on downstream tasks including feedforward reconstruction, dense image matching and relative pose estimation.
- Score: 29.044546222577804
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised learning on images seeks to extract meaningful visual representations from unlabeled data. When scaled to large datasets, this paradigm has achieved state-of-the-art performance and the resulting trained models such as DINOv3 have seen widespread adoption. However, most prior efforts are optimized for semantic understanding rather than geometric reasoning. One important exception is Cross-View Completion, CroCo, which is a form of masked autoencoding (MAE) tailored for 3D understanding. In this work, we continue on the path proposed by CroCo and focus on learning features tailored for 3D vision. In a nutshell, we extend MAE to arbitrarily many views of the same scene. By uniformly masking all views and employing a lightweight decoder with inter-frame attention, our approach is inherently simpler and more scalable than CroCo. We evaluate the resulting model, MuM, extensively on downstream tasks including feedforward reconstruction, dense image matching and relative pose estimation, finding that it outperforms the state-of-the-art visual encoders DINOv3 and CroCo v2.
Related papers
- Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis [38.10984626023432]
We introduce a novel benchmark for in-context 3D scene understanding that requires no finetuning and directly probes the quality of dense visual features.<n>We benchmark 8 state-of-the-art foundation models and show DINO-based encoders remain competitive across large viewpoint shifts.
arXiv Detail & Related papers (2025-12-12T14:03:16Z) - Muskie: Multi-view Masked Image Modeling for 3D Vision Pre-training [21.0991525279]
We present Muskie, a native multi-view backbone vision designed for 3D vision tasks.<n>Muskie is designed to process multiple views simultaneously and introduce multi-view consistency in pre-training stage.<n>We demonstrate that using Muskie as a backbone consistently enhances performance on downstream 3D tasks.
arXiv Detail & Related papers (2025-11-22T16:39:59Z) - Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness [73.72335146374543]
We introduce reconstructive visual instruction tuning with 3D-awareness (Ross3D), which integrates 3D-aware visual supervision into the training procedure.<n>Ross3D achieves state-of-the-art performance across various 3D scene understanding benchmarks.
arXiv Detail & Related papers (2025-04-02T16:59:55Z) - Alligat0R: Pre-Training Through Co-Visibility Segmentation for Relative Camera Pose Regression [23.65253469577653]
We introduce Alligat0R, a novel pre-training approach that reformulates cross-view learning as a co-visibility segmentation task.<n>Our method predicts whether each pixel in one image is co-visible in the second image, occluded, or outside the field of view (FOV)<n>To support this, we present Cub3, a large-scale dataset with 2.5 million image pairs and dense co-visibility annotations.
arXiv Detail & Related papers (2025-03-10T17:29:48Z) - Matryoshka Multimodal Models [92.41824727506751]
We propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens.
We find that COCO-style benchmarks only need around 9 visual tokens to obtain accuracy similar to that of using all 576 tokens.
arXiv Detail & Related papers (2024-05-27T17:59:56Z) - Point Cloud Self-supervised Learning via 3D to Multi-view Masked Learner [19.908670991088556]
We introduce a 3D to multi-view autoencoder that reconstructs point clouds and multi-view images from 3D and projected 2D features.<n>A novel two-stage self-training strategy is proposed to align 2D and 3D representations.<n>Our method outperforms state-of-the-art counterparts across various downstream tasks, including 3D classification, part segmentation, and object detection.
arXiv Detail & Related papers (2023-11-17T22:10:03Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - Multiview Compressive Coding for 3D Reconstruction [77.95706553743626]
We introduce a simple framework that operates on 3D points of single objects or whole scenes.
Our model, Multiview Compressive Coding, learns to compress the input appearance and geometry to predict the 3D structure.
arXiv Detail & Related papers (2023-01-19T18:59:52Z) - MVTN: Learning Multi-View Transformations for 3D Understanding [60.15214023270087]
We introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition.
MVTN can be trained end-to-end with any multi-view network for 3D shape recognition.
Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks.
arXiv Detail & Related papers (2022-12-27T12:09:16Z) - CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View
Completion [20.121597331207276]
Masked Image Modeling (MIM) has recently been established as a potent pre-training paradigm.
In this paper we seek to learn representations that transfer well to a wide variety of 3D vision and lower-level geometric downstream tasks.
Our experiments show that our pretext task leads to significantly improved performance for monocular 3D vision downstream tasks.
arXiv Detail & Related papers (2022-10-19T16:50:36Z) - From 2D Images to 3D Model:Weakly Supervised Multi-View Face Reconstruction with Deep Fusion [25.068822438649928]
We propose a novel pipeline called Deep Fusion MVR to explore the feature correspondences between multi-view images and reconstruct high-precision 3D faces.<n>Specifically, we present a novel multi-view feature fusion backbone that utilizes face masks to align features from multiple encoders.<n>We develop one concise face mask mechanism that facilitates multi-view feature fusion and facial reconstruction.
arXiv Detail & Related papers (2022-04-08T05:11:04Z) - Direct Multi-view Multi-person 3D Pose Estimation [138.48139701871213]
We present Multi-view Pose transformer (MvP) for estimating multi-person 3D poses from multi-view images.
MvP directly regresses the multi-person 3D poses in a clean and efficient way, without relying on intermediate tasks.
We show experimentally that our MvP model outperforms the state-of-the-art methods on several benchmarks while being much more efficient.
arXiv Detail & Related papers (2021-11-07T13:09:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.