M$^{3}$3D: Learning 3D priors using Multi-Modal Masked Autoencoders for
2D image and video understanding
- URL: http://arxiv.org/abs/2309.15313v1
- Date: Tue, 26 Sep 2023 23:52:09 GMT
- Title: M$^{3}$3D: Learning 3D priors using Multi-Modal Masked Autoencoders for
2D image and video understanding
- Authors: Muhammad Abdullah Jamal, Omid Mohareri
- Abstract summary: We present M$3$3D ($underlineM$ulti-$underlineM$odal $underlineM$asked $underline3D$) built based on Multi-modal masked autoencoders.
We integrate two major self-supervised learning frameworks; Masked Image Modeling (MIM) and contrastive learning.
Experiments show that M$3$3D outperforms the existing state-of-the-art approaches on ScanNet, NYUv2, UCF-101 and OR-AR.
- Score: 5.989397492717352
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present a new pre-training strategy called M$^{3}$3D
($\underline{M}$ulti-$\underline{M}$odal $\underline{M}$asked $\underline{3D}$)
built based on Multi-modal masked autoencoders that can leverage 3D priors and
learned cross-modal representations in RGB-D data. We integrate two major
self-supervised learning frameworks; Masked Image Modeling (MIM) and
contrastive learning; aiming to effectively embed masked 3D priors and modality
complementary features to enhance the correspondence between modalities. In
contrast to recent approaches which are either focusing on specific downstream
tasks or require multi-view correspondence, we show that our pre-training
strategy is ubiquitous, enabling improved representation learning that can
transfer into improved performance on various downstream tasks such as video
action recognition, video action detection, 2D semantic segmentation and depth
estimation. Experiments show that M$^{3}$3D outperforms the existing
state-of-the-art approaches on ScanNet, NYUv2, UCF-101 and OR-AR, particularly
with an improvement of +1.3\% mIoU against Mask3D on ScanNet semantic
segmentation. We further evaluate our method on low-data regime and demonstrate
its superior data efficiency compared to current state-of-the-art approaches.
Related papers
- Contrastive masked auto-encoders based self-supervised hashing for 2D image and 3D point cloud cross-modal retrieval [5.965791109321719]
Cross-modal hashing between 2D images and 3D point-cloud data is a growing concern in real-world retrieval systems.
We propose contrastive masked autoencoders based self-supervised hashing (CMAH) for retrieval between images and point-cloud data.
arXiv Detail & Related papers (2024-08-11T07:03:21Z) - A Two-Stage Progressive Pre-training using Multi-Modal Contrastive Masked Autoencoders [5.069884983892437]
We propose a new progressive pre-training method for image understanding tasks which leverages RGB-D datasets.
In the first stage, we pre-train the model using contrastive learning to learn cross-modal representations.
In the second stage, we further pre-train the model using masked autoencoding and denoising/noise prediction.
Our approach is scalable, robust and suitable for pre-training RGB-D datasets.
arXiv Detail & Related papers (2024-08-05T05:33:59Z) - TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding [28.112402580426174]
TriAdapter Multi-Modal Learning (TAMM) is a novel two-stage learning approach based on three synergistic adapters.
TAMM consistently enhances 3D representations for a wide range of 3D encoder architectures, pre-training datasets, and downstream tasks.
arXiv Detail & Related papers (2024-02-28T17:18:38Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - Beyond First Impressions: Integrating Joint Multi-modal Cues for
Comprehensive 3D Representation [72.94143731623117]
Existing methods simply align 3D representations with single-view 2D images and coarse-grained parent category text.
Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space.
We propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image.
arXiv Detail & Related papers (2023-08-06T01:11:40Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D
Object Detection [26.03582038710992]
Masked Autoencoders learn strong visual representations and achieve state-of-the-art results in several independent modalities.
In this work, we focus on point cloud and RGB image data, two modalities that are often presented together in the real world.
We propose PiMAE, a self-supervised pre-training framework that promotes 3D and 2D interaction through three aspects.
arXiv Detail & Related papers (2023-03-14T17:58:03Z) - Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud
Pre-training [65.75399500494343]
Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training.
arXiv Detail & Related papers (2023-02-27T17:56:18Z) - Learning 3D Representations from 2D Pre-trained Models via
Image-to-Point Masked Autoencoders [52.91248611338202]
We propose an alternative to obtain superior 3D representations from 2D pre-trained models via Image-to-Point Masked Autoencoders, named as I2P-MAE.
By self-supervised pre-training, we leverage the well learned 2D knowledge to guide 3D masked autoencoding.
I2P-MAE attains the state-of-the-art 90.11% accuracy, +3.68% to the second-best, demonstrating superior transferable capacity.
arXiv Detail & Related papers (2022-12-13T17:59:20Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.