Can Text-to-image Model Assist Multi-modal Learning for Visual
  Recognition with Visual Modality Missing?
        - URL: http://arxiv.org/abs/2402.09036v1
- Date: Wed, 14 Feb 2024 09:21:00 GMT
- Title: Can Text-to-image Model Assist Multi-modal Learning for Visual
  Recognition with Visual Modality Missing?
- Authors: Tiantian Feng and Daniel Yang and Digbalay Bose and Shrikanth
  Narayanan
- Abstract summary: We propose a text-to-image framework GTI-MM to enhance the data efficiency and model robustness against missing visual modality.
Our findings reveal that synthetic images benefit training data efficiency with visual data missing in training and improve model robustness with visual data missing involving training and testing.
- Score: 37.73329106465031
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Multi-modal learning has emerged as an increasingly promising avenue in
vision recognition, driving innovations across diverse domains ranging from
media and education to healthcare and transportation. Despite its success, the
robustness of multi-modal learning for visual recognition is often challenged
by the unavailability of a subset of modalities, especially the visual
modality. Conventional approaches to mitigate missing modalities in multi-modal
learning rely heavily on algorithms and modality fusion schemes. In contrast,
this paper explores the use of text-to-image models to assist multi-modal
learning. Specifically, we propose a simple but effective multi-modal learning
framework GTI-MM to enhance the data efficiency and model robustness against
missing visual modality by imputing the missing data with generative
transformers. Using multiple multi-modal datasets with visual recognition
tasks, we present a comprehensive analysis of diverse conditions involving
missing visual modality in data, including model training. Our findings reveal
that synthetic images benefit training data efficiency with visual data missing
in training and improve model robustness with visual data missing involving
training and testing. Moreover, we demonstrate GTI-MM is effective with lower
generation quantity and simple prompt techniques.
 
      
        Related papers
        - True Multimodal In-Context Learning Needs Attention to the Visual   Context [69.63677595066012]
 Multimodal Large Language Models (MLLMs) have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks.<n>Current MLLMs tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation.<n>We introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context.
 arXiv  Detail & Related papers  (2025-07-21T17:08:18Z)
- Quantifying Cross-Modality Memorization in Vision-Language Models [86.82366725590508]
 We study the unique characteristics of cross-modality memorization and conduct a systematic study centered on vision-language models.<n>Our results reveal that facts learned in one modality transfer to the other, but a significant gap exists between recalling information in the source and target modalities.
 arXiv  Detail & Related papers  (2025-06-05T16:10:47Z)
- Multimodal Representation Learning using Adaptive Graph Construction [0.5221459608786241]
 Multimodal contrastive learning train neural networks by levergaing data from heterogeneous sources such as images and text.
We propose AutoBIND, a novel contrastive learning framework that can learn representations from an arbitrary number of modalites.
We show that AutoBIND outperforms previous methods on this task, highlighting the generalizablility of the approach.
 arXiv  Detail & Related papers  (2024-10-08T21:57:46Z)
- Multimodal Prompt Learning with Missing Modalities for Sentiment   Analysis and Emotion Recognition [52.522244807811894]
 We propose a novel multimodal Transformer framework using prompt learning to address the issue of missing modalities.
Our method introduces three types of prompts: generative prompts, missing-signal prompts, and missing-type prompts.
Through prompt learning, we achieve a substantial reduction in the number of trainable parameters.
 arXiv  Detail & Related papers  (2024-07-07T13:55:56Z)
- Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression   Recognition [6.995226697189459]
 We employ a multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data.
Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks.
We release our pre-trained models as well as source code publicly.
 arXiv  Detail & Related papers  (2024-04-16T20:51:36Z)
- Delving into Multi-modal Multi-task Foundation Models for Road Scene   Understanding: From Learning Paradigm Perspectives [56.2139730920855]
 We present a systematic analysis of MM-VUFMs specifically designed for road scenes.
Our objective is to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques.
We provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
 arXiv  Detail & Related papers  (2024-02-05T12:47:09Z)
- What Makes for Robust Multi-Modal Models in the Face of Missing
  Modalities? [35.19295402483624]
 We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective.
We introduce Uni-Modal Ensemble with Missing Modality Adaptation (UME-MMA)
UME-MMA employs uni-modal pre-trained weights for the multi-modal model to enhance feature extraction and utilizes missing modality data augmentation techniques to better adapt to situations with missing modalities.
 arXiv  Detail & Related papers  (2023-10-10T07:47:57Z)
- StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
  Image-Dialogue Data [129.92449761766025]
 We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
 arXiv  Detail & Related papers  (2023-08-20T12:43:52Z)
- Learning Multimodal Data Augmentation in Feature Space [65.54623807628536]
 LeMDA is an easy-to-use method that automatically learns to jointly augment multimodal data in feature space.
We show that LeMDA can profoundly improve the performance of multimodal deep learning architectures.
 arXiv  Detail & Related papers  (2022-12-29T20:39:36Z)
- mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
  Skip-connections [104.14624185375897]
 mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
 arXiv  Detail & Related papers  (2022-05-24T11:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.