Learning Multimodal Data Augmentation in Feature Space
- URL: http://arxiv.org/abs/2212.14453v2
- Date: Mon, 24 Apr 2023 14:48:00 GMT
- Title: Learning Multimodal Data Augmentation in Feature Space
- Authors: Zichang Liu, Zhiqiang Tang, Xingjian Shi, Aston Zhang, Mu Li,
Anshumali Shrivastava, Andrew Gordon Wilson
- Abstract summary: LeMDA is an easy-to-use method that automatically learns to jointly augment multimodal data in feature space.
We show that LeMDA can profoundly improve the performance of multimodal deep learning architectures.
- Score: 65.54623807628536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ability to jointly learn from multiple modalities, such as text, audio,
and visual data, is a defining feature of intelligent systems. While there have
been promising advances in designing neural networks to harness multimodal
data, the enormous success of data augmentation currently remains limited to
single-modality tasks like image classification. Indeed, it is particularly
difficult to augment each modality while preserving the overall semantic
structure of the data; for example, a caption may no longer be a good
description of an image after standard augmentations have been applied, such as
translation. Moreover, it is challenging to specify reasonable transformations
that are not tailored to a particular modality. In this paper, we introduce
LeMDA, Learning Multimodal Data Augmentation, an easy-to-use method that
automatically learns to jointly augment multimodal data in feature space, with
no constraints on the identities of the modalities or the relationship between
modalities. We show that LeMDA can (1) profoundly improve the performance of
multimodal deep learning architectures, (2) apply to combinations of modalities
that have not been previously considered, and (3) achieve state-of-the-art
results on a wide range of applications comprised of image, text, and tabular
data.
Related papers
- MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct [148.39859547619156]
We propose MMEvol, a novel multimodal instruction data evolution framework.
MMEvol iteratively improves data quality through a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution.
Our approach reaches state-of-the-art (SOTA) performance in nine tasks using significantly less data compared to state-of-the-art models.
arXiv Detail & Related papers (2024-09-09T17:44:00Z) - Chameleon: Images Are What You Need For Multimodal Learning Robust To Missing Modalities [17.723207830420996]
Multimodal learning methods often exhibit deteriorated performances if one or more modalities are missing.
We propose a robust textual-visual multimodal learning method, Chameleon, that completely deviates from the conventional multi-branch design.
Experiments are performed on four popular datasets including Hateful Memes, UPMC Food-101, MM-IMDb, and Ferramenta.
arXiv Detail & Related papers (2024-07-23T07:29:57Z) - Semantic-Aware Representation of Multi-Modal Data for Data Ingress: A Literature Review [1.8590097948961688]
Generative AI such as Large Language Models (LLMs) sees broad adoption to process multi-modal data such as text, images, audio, and video.
Managing this data efficiently has become a significant practical challenge in the industry-double as much data is not double as good.
This study focuses on the different semantic-aware techniques to extract embeddings from mono-modal, multi-modal, and cross-modal data.
arXiv Detail & Related papers (2024-07-17T09:49:11Z) - NativE: Multi-modal Knowledge Graph Completion in the Wild [51.80447197290866]
We propose a comprehensive framework NativE to achieve MMKGC in the wild.
NativE proposes a relation-guided dual adaptive fusion module that enables adaptive fusion for any modalities.
We construct a new benchmark called WildKGC with five datasets to evaluate our method.
arXiv Detail & Related papers (2024-03-28T03:04:00Z) - Can Text-to-image Model Assist Multi-modal Learning for Visual
Recognition with Visual Modality Missing? [37.73329106465031]
We propose a text-to-image framework GTI-MM to enhance the data efficiency and model robustness against missing visual modality.
Our findings reveal that synthetic images benefit training data efficiency with visual data missing in training and improve model robustness with visual data missing involving training and testing.
arXiv Detail & Related papers (2024-02-14T09:21:00Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - What Makes for Robust Multi-Modal Models in the Face of Missing
Modalities? [35.19295402483624]
We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective.
We introduce Uni-Modal Ensemble with Missing Modality Adaptation (UME-MMA)
UME-MMA employs uni-modal pre-trained weights for the multi-modal model to enhance feature extraction and utilizes missing modality data augmentation techniques to better adapt to situations with missing modalities.
arXiv Detail & Related papers (2023-10-10T07:47:57Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - S-Omninet: Structured Data Enhanced Universal Multimodal Learning
Architecture [19.927662512903915]
Multimodal multitask learning has attracted an increasing interest in recent years.
Many methods are proposed to learn on a specific type of multimodal data, such as vision and language data.
We extend and improve Omninet, an architecture that is capable of handling multiple modalities and tasks at a time.
arXiv Detail & Related papers (2023-07-01T05:02:46Z) - Factorized Contrastive Learning: Going Beyond Multi-view Redundancy [116.25342513407173]
This paper proposes FactorCL, a new multimodal representation learning method to go beyond multi-view redundancy.
On large-scale real-world datasets, FactorCL captures both shared and unique information and achieves state-of-the-art results.
arXiv Detail & Related papers (2023-06-08T15:17:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.