Related papers: TripletMix: Triplet Data Augmentation for 3D Understanding

TripletMix: Triplet Data Augmentation for 3D Understanding

URL: http://arxiv.org/abs/2405.18523v1
Date: Tue, 28 May 2024 18:44:15 GMT
Title: TripletMix: Triplet Data Augmentation for 3D Understanding
Authors: Jiaze Wang, Yi Wang, Ziyu Guo, Renrui Zhang, Donghao Zhou, Guangyong Chen, Anfeng Liu, Pheng-Ann Heng,
Abstract summary: TripletMix is a novel approach to address the previously unexplored issue of multimodal data augmentation in 3D understanding. Our findings highlight the potential of multimodal data augmentation to significantly advance 3D object recognition and understanding.
Score: 64.65145700121442
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data augmentation has proven to be a vital tool for enhancing the generalization capabilities of deep learning models, especially in the context of 3D vision where traditional datasets are often limited. Despite previous advancements, existing methods primarily cater to unimodal data scenarios, leaving a gap in the augmentation of multimodal triplet data, which integrates text, images, and point clouds. Simultaneously augmenting all three modalities enhances diversity and improves alignment across modalities, resulting in more comprehensive and robust 3D representations. To address this gap, we propose TripletMix, a novel approach to address the previously unexplored issue of multimodal data augmentation in 3D understanding. TripletMix innovatively applies the principles of mixed-based augmentation to multimodal triplet data, allowing for the preservation and optimization of cross-modal connections. Our proposed TripletMix combines feature-level and input-level augmentations to achieve dual enhancement between raw data and latent features, significantly improving the model's cross-modal understanding and generalization capabilities by ensuring feature consistency and providing diverse and realistic training samples. We demonstrate that TripletMix not only improves the baseline performance of models in various learning scenarios including zero-shot and linear probing classification but also significantly enhances model generalizability. Notably, we improved the zero-shot classification accuracy on ScanObjectNN from 51.3 percent to 61.9 percent, and on Objaverse-LVIS from 46.8 percent to 51.4 percent. Our findings highlight the potential of multimodal data augmentation to significantly advance 3D object recognition and understanding.

Related papers

Bridging Geometry-Coherent Text-to-3D Generation with Multi-View Diffusion Priors and Gaussian Splatting [51.08718483081347]
We propose a framework that couples multi-view joint distribution priors to ensure geometrically consistent 3D generation.<n>We derive an effective optimization rule that effectively couples multi-view priors to guide optimization across different viewpoints.<n>We employ a deformable tetrahedral grid, from 3D-GS and refined through CSD, to produce high-quality, refined meshes.
arXiv Detail & Related papers (2025-05-07T09:12:45Z)
Mixup Model Merge: Enhancing Model Merging Performance through Randomized Linear Interpolation [15.47711837051754]
Model merging aims to integrate multiple task-specific models into a unified model that inherits the capabilities of the task-specific models.<n>Existing model merging methods often lack consideration of the varying contribution ratios of different task-specific models to the final merged model.<n>We propose Mixup Model Merge (M3), a simple yet effective method inspired by the randomized linear strategy from the Mixup data augmentation technique.
arXiv Detail & Related papers (2025-02-21T13:01:26Z)
Adaptive Mix for Semi-Supervised Medical Image Segmentation [22.69909762038458]
We propose an Adaptive Mix algorithm (AdaMix) for image mix-up in a self-paced learning manner. We develop three frameworks with our AdaMix, i.e., AdaMix-ST, AdaMix-MT, and AdaMix-CT, for semi-supervised medical image segmentation.
arXiv Detail & Related papers (2024-07-31T13:19:39Z)
Multi-modal Relation Distillation for Unified 3D Representation Learning [30.942281325891226]
Multi-modal Relation Distillation (MRD) is a tri-modal pre-training framework designed to distill reputable large Vision-Language Models (VLM) into 3D backbones. MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations.
arXiv Detail & Related papers (2024-07-19T03:43:48Z)
Multiway Point Cloud Mosaicking with Diffusion and Global Optimization [74.3802812773891]
We introduce a novel framework for multiway point cloud mosaicking (named Wednesday) At the core of our approach is ODIN, a learned pairwise registration algorithm that identifies overlaps and refines attention scores. Tested on four diverse, large-scale datasets, our method state-of-the-art pairwise and rotation registration results by a large margin on all benchmarks.
arXiv Detail & Related papers (2024-03-30T17:29:13Z)
TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding [28.112402580426174]
TriAdapter Multi-Modal Learning (TAMM) is a novel two-stage learning approach based on three synergistic adapters. TAMM consistently enhances 3D representations for a wide range of 3D encoder architectures, pre-training datasets, and downstream tasks.
arXiv Detail & Related papers (2024-02-28T17:18:38Z)
PowMix: A Versatile Regularizer for Multimodal Sentiment Analysis [71.8946280170493]
This paper introduces PowMix, a versatile embedding space regularizer that builds upon the strengths of unimodal mixing-based regularization approaches. PowMix is integrated before the fusion stage of multimodal architectures and facilitates intra-modal mixing, such as mixing text with text, to act as a regularizer.
arXiv Detail & Related papers (2023-12-19T17:01:58Z)
Connecting Multi-modal Contrastive Representations [50.26161419616139]
Multi-modal Contrastive Representation learning aims to encode different modalities into a semantically shared space. This paper proposes a novel training-efficient method for learning MCR without paired data called Connecting Multi-modal Contrastive Representations (C-MCR) C-MCR achieves audio-visual state-of-the-art performance on audio-image retrieval, audio-visual source localization, and counterfactual audio-image recognition tasks.
arXiv Detail & Related papers (2023-05-22T09:44:39Z)
MixupE: Understanding and Improving Mixup from Directional Derivative Perspective [86.06981860668424]
We propose an improved version of Mixup, theoretically justified to deliver better generalization performance than the vanilla Mixup. Our results show that the proposed method improves Mixup across multiple datasets using a variety of architectures.
arXiv Detail & Related papers (2022-12-27T07:03:52Z)
SageMix: Saliency-Guided Mixup for Point Clouds [14.94694648742664]
We propose SageMix, a saliency-guided Mixup for point clouds to preserve salient local structures. With PointNet++, our method achieves an accuracy gain of 2.6% and 4.0% over standard training in 3D Warehouse dataset (MN40) and ScanObjectNN, respectively.
arXiv Detail & Related papers (2022-10-13T12:19:58Z)
Pose Adaptive Dual Mixup for Few-Shot Single-View 3D Reconstruction [35.30827580375749]
We present a pose adaptive few-shot learning procedure and a two-stage data regularization, termed PADMix, for single-image 3D reconstruction. PADMix significantly outperforms previous literature on few-shot settings over the ShapeNet dataset and sets new benchmarks on the more challenging real-world Pix3D dataset.
arXiv Detail & Related papers (2021-12-23T12:22:08Z)
Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks [75.69896269357005]
Mixup is the latest data augmentation technique that linearly interpolates input examples and the corresponding labels. In this paper, we explore how to apply mixup to natural language processing tasks. We incorporate mixup to transformer-based pre-trained architecture, named "mixup-transformer", for a wide range of NLP tasks.
arXiv Detail & Related papers (2020-10-05T23:37:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.