Can Text-to-image Model Assist Multi-modal Learning for Visual
Recognition with Visual Modality Missing?
- URL: http://arxiv.org/abs/2402.09036v1
- Date: Wed, 14 Feb 2024 09:21:00 GMT
- Title: Can Text-to-image Model Assist Multi-modal Learning for Visual
Recognition with Visual Modality Missing?
- Authors: Tiantian Feng and Daniel Yang and Digbalay Bose and Shrikanth
Narayanan
- Abstract summary: We propose a text-to-image framework GTI-MM to enhance the data efficiency and model robustness against missing visual modality.
Our findings reveal that synthetic images benefit training data efficiency with visual data missing in training and improve model robustness with visual data missing involving training and testing.
- Score: 37.73329106465031
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal learning has emerged as an increasingly promising avenue in
vision recognition, driving innovations across diverse domains ranging from
media and education to healthcare and transportation. Despite its success, the
robustness of multi-modal learning for visual recognition is often challenged
by the unavailability of a subset of modalities, especially the visual
modality. Conventional approaches to mitigate missing modalities in multi-modal
learning rely heavily on algorithms and modality fusion schemes. In contrast,
this paper explores the use of text-to-image models to assist multi-modal
learning. Specifically, we propose a simple but effective multi-modal learning
framework GTI-MM to enhance the data efficiency and model robustness against
missing visual modality by imputing the missing data with generative
transformers. Using multiple multi-modal datasets with visual recognition
tasks, we present a comprehensive analysis of diverse conditions involving
missing visual modality in data, including model training. Our findings reveal
that synthetic images benefit training data efficiency with visual data missing
in training and improve model robustness with visual data missing involving
training and testing. Moreover, we demonstrate GTI-MM is effective with lower
generation quantity and simple prompt techniques.
Related papers
- Multimodal Representation Learning using Adaptive Graph Construction [0.5221459608786241]
Multimodal contrastive learning train neural networks by levergaing data from heterogeneous sources such as images and text.
We propose AutoBIND, a novel contrastive learning framework that can learn representations from an arbitrary number of modalites.
We show that AutoBIND outperforms previous methods on this task, highlighting the generalizablility of the approach.
arXiv Detail & Related papers (2024-10-08T21:57:46Z) - Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition [52.522244807811894]
We propose a novel multimodal Transformer framework using prompt learning to address the issue of missing modalities.
Our method introduces three types of prompts: generative prompts, missing-signal prompts, and missing-type prompts.
Through prompt learning, we achieve a substantial reduction in the number of trainable parameters.
arXiv Detail & Related papers (2024-07-07T13:55:56Z) - Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition [6.995226697189459]
We employ a multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data.
Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks.
We release our pre-trained models as well as source code publicly.
arXiv Detail & Related papers (2024-04-16T20:51:36Z) - Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives [56.2139730920855]
We present a systematic analysis of MM-VUFMs specifically designed for road scenes.
Our objective is to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques.
We provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
arXiv Detail & Related papers (2024-02-05T12:47:09Z) - What Makes for Robust Multi-Modal Models in the Face of Missing
Modalities? [35.19295402483624]
We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective.
We introduce Uni-Modal Ensemble with Missing Modality Adaptation (UME-MMA)
UME-MMA employs uni-modal pre-trained weights for the multi-modal model to enhance feature extraction and utilizes missing modality data augmentation techniques to better adapt to situations with missing modalities.
arXiv Detail & Related papers (2023-10-10T07:47:57Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Learning Multimodal Data Augmentation in Feature Space [65.54623807628536]
LeMDA is an easy-to-use method that automatically learns to jointly augment multimodal data in feature space.
We show that LeMDA can profoundly improve the performance of multimodal deep learning architectures.
arXiv Detail & Related papers (2022-12-29T20:39:36Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.