MM-Mixing: Multi-Modal Mixing Alignment for 3D Understanding
- URL: http://arxiv.org/abs/2405.18523v2
- Date: Mon, 19 Aug 2024 08:26:25 GMT
- Title: MM-Mixing: Multi-Modal Mixing Alignment for 3D Understanding
- Authors: Jiaze Wang, Yi Wang, Ziyu Guo, Renrui Zhang, Donghao Zhou, Guangyong Chen, Anfeng Liu, Pheng-Ann Heng,
- Abstract summary: We introduce MM-Mixing, a multi-modal mixing alignment framework for 3D understanding.
Our proposed two-stage training pipeline combines feature-level and input-level mixing to optimize the 3D encoder.
We demonstrate that MM-Mixing significantly improves baseline performance across various learning scenarios.
- Score: 64.65145700121442
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce MM-Mixing, a multi-modal mixing alignment framework for 3D understanding. MM-Mixing applies mixing-based methods to multi-modal data, preserving and optimizing cross-modal connections while enhancing diversity and improving alignment across modalities. Our proposed two-stage training pipeline combines feature-level and input-level mixing to optimize the 3D encoder. The first stage employs feature-level mixing with contrastive learning to align 3D features with their corresponding modalities. The second stage incorporates both feature-level and input-level mixing, introducing mixed point cloud inputs to further refine 3D feature representations. MM-Mixing enhances intermodality relationships, promotes generalization, and ensures feature consistency while providing diverse and realistic training samples. We demonstrate that MM-Mixing significantly improves baseline performance across various learning scenarios, including zero-shot 3D classification, linear probing 3D classification, and cross-modal 3D shape retrieval. Notably, we improved the zero-shot classification accuracy on ScanObjectNN from 51.3% to 61.9%, and on Objaverse-LVIS from 46.8% to 51.4%. Our findings highlight the potential of multi-modal mixing-based alignment to significantly advance 3D object recognition and understanding while remaining straightforward to implement and integrate into existing frameworks.
Related papers
- Adaptive Mix for Semi-Supervised Medical Image Segmentation [22.69909762038458]
We propose an Adaptive Mix algorithm (AdaMix) for image mix-up in a self-paced learning manner.
We develop three frameworks with our AdaMix, i.e., AdaMix-ST, AdaMix-MT, and AdaMix-CT, for semi-supervised medical image segmentation.
arXiv Detail & Related papers (2024-07-31T13:19:39Z) - Multi-modal Relation Distillation for Unified 3D Representation Learning [30.942281325891226]
Multi-modal Relation Distillation (MRD) is a tri-modal pre-training framework designed to distill reputable large Vision-Language Models (VLM) into 3D backbones.
MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations.
arXiv Detail & Related papers (2024-07-19T03:43:48Z) - Multiway Point Cloud Mosaicking with Diffusion and Global Optimization [74.3802812773891]
We introduce a novel framework for multiway point cloud mosaicking (named Wednesday)
At the core of our approach is ODIN, a learned pairwise registration algorithm that identifies overlaps and refines attention scores.
Tested on four diverse, large-scale datasets, our method state-of-the-art pairwise and rotation registration results by a large margin on all benchmarks.
arXiv Detail & Related papers (2024-03-30T17:29:13Z) - PowMix: A Versatile Regularizer for Multimodal Sentiment Analysis [71.8946280170493]
This paper introduces PowMix, a versatile embedding space regularizer that builds upon the strengths of unimodal mixing-based regularization approaches.
PowMix is integrated before the fusion stage of multimodal architectures and facilitates intra-modal mixing, such as mixing text with text, to act as a regularizer.
arXiv Detail & Related papers (2023-12-19T17:01:58Z) - Connecting Multi-modal Contrastive Representations [50.26161419616139]
Multi-modal Contrastive Representation learning aims to encode different modalities into a semantically shared space.
This paper proposes a novel training-efficient method for learning MCR without paired data called Connecting Multi-modal Contrastive Representations (C-MCR)
C-MCR achieves audio-visual state-of-the-art performance on audio-image retrieval, audio-visual source localization, and counterfactual audio-image recognition tasks.
arXiv Detail & Related papers (2023-05-22T09:44:39Z) - MixupE: Understanding and Improving Mixup from Directional Derivative
Perspective [86.06981860668424]
We propose an improved version of Mixup, theoretically justified to deliver better generalization performance than the vanilla Mixup.
Our results show that the proposed method improves Mixup across multiple datasets using a variety of architectures.
arXiv Detail & Related papers (2022-12-27T07:03:52Z) - SageMix: Saliency-Guided Mixup for Point Clouds [14.94694648742664]
We propose SageMix, a saliency-guided Mixup for point clouds to preserve salient local structures.
With PointNet++, our method achieves an accuracy gain of 2.6% and 4.0% over standard training in 3D Warehouse dataset (MN40) and ScanObjectNN, respectively.
arXiv Detail & Related papers (2022-10-13T12:19:58Z) - Pose Adaptive Dual Mixup for Few-Shot Single-View 3D Reconstruction [35.30827580375749]
We present a pose adaptive few-shot learning procedure and a two-stage data regularization, termed PADMix, for single-image 3D reconstruction.
PADMix significantly outperforms previous literature on few-shot settings over the ShapeNet dataset and sets new benchmarks on the more challenging real-world Pix3D dataset.
arXiv Detail & Related papers (2021-12-23T12:22:08Z) - Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks [75.69896269357005]
Mixup is the latest data augmentation technique that linearly interpolates input examples and the corresponding labels.
In this paper, we explore how to apply mixup to natural language processing tasks.
We incorporate mixup to transformer-based pre-trained architecture, named "mixup-transformer", for a wide range of NLP tasks.
arXiv Detail & Related papers (2020-10-05T23:37:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.