Muskie: Multi-view Masked Image Modeling for 3D Vision Pre-training
- URL: http://arxiv.org/abs/2511.18115v1
- Date: Sat, 22 Nov 2025 16:39:59 GMT
- Title: Muskie: Multi-view Masked Image Modeling for 3D Vision Pre-training
- Authors: Wenyu Li, Sidun Liu, Peng Qiao, Yong Dou, Tongrui Hu,
- Abstract summary: We present Muskie, a native multi-view backbone vision designed for 3D vision tasks.<n>Muskie is designed to process multiple views simultaneously and introduce multi-view consistency in pre-training stage.<n>We demonstrate that using Muskie as a backbone consistently enhances performance on downstream 3D tasks.
- Score: 21.0991525279
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Muskie, a native multi-view vision backbone designed for 3D vision tasks. Unlike existing models, which are frame-wise and exhibit limited multi-view consistency, Muskie is designed to process multiple views simultaneously and introduce multi-view consistency in pre-training stage. Muskie is trained to reconstruct heavily masked content in one view by finding and utilizing geometric correspondences from other views. Through this pretext task and our proposed aggressive masking strategy, the model implicitly to learn view-invariant features and develop strong geometric understanding without any 3D supervision. Compared with state-of-the-art frame-wise backbones such as DINO, Muskie achieves higher multi-view correspondence accuracy. Furthermore, we demonstrate that using Muskie as a backbone consistently enhances performance on downstream 3D tasks, including camera pose estimation and pointmap reconstruction. Codes are publicly available at https://leo-frank.github.io/Muskie/
Related papers
- Unified Semantic Transformer for 3D Scene Understanding [55.415468022487005]
We introduce UNITE, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model.<n>Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry.<n>We demonstrate that UNITE achieves state-of-the-art performance on several different semantic tasks and even outperforms task-specific models.
arXiv Detail & Related papers (2025-12-16T12:49:35Z) - MuM: Multi-View Masked Image Modeling for 3D Vision [29.044546222577804]
Self-supervised learning on images seeks to extract meaningful visual representations from unlabeled data.<n>In this work, we focus on learning features tailored for 3D vision.<n>We extend MAE to arbitrarily many views of the same scene and employ a lightweight decoder with inter-frame attention.<n>We evaluate the resulting model, MuM, extensively on downstream tasks including feedforward reconstruction, dense image matching and relative pose estimation.
arXiv Detail & Related papers (2025-11-21T15:25:47Z) - Enhancing Monocular 3D Scene Completion with Diffusion Model [20.81599069390756]
3D scene reconstruction is essential for applications in virtual reality, robotics, and autonomous driving.<n>Traditional 3D Gaussian Splatting techniques rely on images captured from multiple viewpoints to achieve optimal performance.<n>We introduce FlashDreamer, a novel approach for reconstructing a complete 3D scene from a single image.
arXiv Detail & Related papers (2025-03-02T04:36:57Z) - MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model [87.71060849866093]
We introduce MVGenMaster, a multi-view diffusion model enhanced with 3D priors to address versatile Novel View Synthesis (NVS) tasks.<n>Our model features a simple yet effective pipeline that can generate up to 100 novel views conditioned on variable reference views and camera poses.<n>We present several training and model modifications to strengthen the model with scaled-up datasets.
arXiv Detail & Related papers (2024-11-25T07:34:23Z) - Multi-View Large Reconstruction Model via Geometry-Aware Positional Encoding and Attention [54.66152436050373]
We propose a Multi-view Large Reconstruction Model (M-LRM) to reconstruct high-quality 3D shapes from multi-views in a 3D-aware manner.<n>Specifically, we introduce a multi-view consistent cross-attention scheme to enable M-LRM to accurately query information from the input images.<n>Compared to previous methods, the proposed M-LRM can generate 3D shapes of high fidelity.
arXiv Detail & Related papers (2024-06-11T18:29:13Z) - MVGamba: Unify 3D Content Generation as State Space Sequence Modeling [150.80564081817786]
We introduce MVGamba, a general and lightweight Gaussian reconstruction model featuring a multi-view Gaussian reconstructor.<n>With off-the-detail multi-view diffusion models integrated, MVGamba unifies 3D generation tasks from a single image, sparse images, or text prompts.<n>Experiments demonstrate that MVGamba outperforms state-of-the-art baselines in all 3D content generation scenarios with approximately only $0.1times$ of the model size.
arXiv Detail & Related papers (2024-06-10T15:26:48Z) - DiffPoint: Single and Multi-view Point Cloud Reconstruction with ViT
Based Diffusion Model [10.253402444122084]
We propose a neat and powerful architecture called DiffPoint that combines ViT and diffusion models for the task of point cloud reconstruction.
We evaluate DiffPoint on both single-view and multi-view reconstruction tasks and achieve state-of-the-art results.
arXiv Detail & Related papers (2024-02-17T10:18:40Z) - Point Cloud Self-supervised Learning via 3D to Multi-view Masked Learner [19.908670991088556]
We introduce a 3D to multi-view autoencoder that reconstructs point clouds and multi-view images from 3D and projected 2D features.<n>A novel two-stage self-training strategy is proposed to align 2D and 3D representations.<n>Our method outperforms state-of-the-art counterparts across various downstream tasks, including 3D classification, part segmentation, and object detection.
arXiv Detail & Related papers (2023-11-17T22:10:03Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [111.16358607889609]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.<n>For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - MVTN: Learning Multi-View Transformations for 3D Understanding [60.15214023270087]
We introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition.
MVTN can be trained end-to-end with any multi-view network for 3D shape recognition.
Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks.
arXiv Detail & Related papers (2022-12-27T12:09:16Z) - From 2D Images to 3D Model:Weakly Supervised Multi-View Face Reconstruction with Deep Fusion [25.068822438649928]
We propose a novel pipeline called Deep Fusion MVR to explore the feature correspondences between multi-view images and reconstruct high-precision 3D faces.<n>Specifically, we present a novel multi-view feature fusion backbone that utilizes face masks to align features from multiple encoders.<n>We develop one concise face mask mechanism that facilitates multi-view feature fusion and facial reconstruction.
arXiv Detail & Related papers (2022-04-08T05:11:04Z) - Direct Multi-view Multi-person 3D Pose Estimation [138.48139701871213]
We present Multi-view Pose transformer (MvP) for estimating multi-person 3D poses from multi-view images.
MvP directly regresses the multi-person 3D poses in a clean and efficient way, without relying on intermediate tasks.
We show experimentally that our MvP model outperforms the state-of-the-art methods on several benchmarks while being much more efficient.
arXiv Detail & Related papers (2021-11-07T13:09:20Z) - Multi-View Matching (MVM): Facilitating Multi-Person 3D Pose Estimation
Learning with Action-Frozen People Video [38.63662549684785]
MVM method generates reliable 3D human poses from a large-scale video dataset.
We train a neural network that takes a single image as the input for multi-person 3D pose estimation.
arXiv Detail & Related papers (2020-04-11T01:09:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.