Enhance Modality Robustness in Text-Centric Multimodal Alignment with Adversarial Prompting
- URL: http://arxiv.org/abs/2408.09798v1
- Date: Mon, 19 Aug 2024 08:44:55 GMT
- Title: Enhance Modality Robustness in Text-Centric Multimodal Alignment with Adversarial Prompting
- Authors: Yun-Da Tsai, Ting-Yu Yen, Keng-Te Liao, Shou-De Lin,
- Abstract summary: This study evaluates the quality and robustness of multimodal representations in the face of noise imperfections, dynamic input order permutations, and missing modalities.
We propose a new text-centric adversarial training approach that significantly enhances robustness compared to traditional robust training methods and pre-trained multimodal foundation models.
- Score: 4.985886792128721
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Converting different modalities into generalized text, which then serves as input prompts for large language models (LLMs), is a common approach for aligning multimodal models, particularly when pairwise data is limited. Text-centric alignment method leverages the unique properties of text as a modality space, transforming diverse inputs into a unified textual representation, thereby enabling downstream models to effectively interpret various modal inputs. This study evaluates the quality and robustness of multimodal representations in the face of noise imperfections, dynamic input order permutations, and missing modalities, revealing that current text-centric alignment methods can compromise downstream robustness. To address this issue, we propose a new text-centric adversarial training approach that significantly enhances robustness compared to traditional robust training methods and pre-trained multimodal foundation models. Our findings underscore the potential of this approach to improve the robustness and adaptability of multimodal representations, offering a promising solution for dynamic and real-world applications.
Related papers
- Enhance the Robustness of Text-Centric Multimodal Alignments [4.985886792128721]
This study assesses the quality and robustness of multimodal representations in the presence of missing entries, noise, or absent modalities.
We propose a new text-centric approach that achieves superior robustness compared to previous methods.
arXiv Detail & Related papers (2024-07-06T10:12:29Z) - Text-centric Alignment for Multi-Modality Learning [3.6961400222746748]
We propose the Text-centric Alignment for Multi-Modality Learning (TAMML) approach.
By leveraging the unique properties of text as a unified semantic space, TAMML demonstrates significant improvements in handling unseen, diverse, and unpredictable modality combinations.
This study contributes to the field by offering a flexible, effective solution for real-world applications where modality availability is dynamic and uncertain.
arXiv Detail & Related papers (2024-02-12T22:07:43Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects.
In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts.
We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - Cross-Modal Attribute Insertions for Assessing the Robustness of
Vision-and-Language Learning [9.949354222717773]
Cross-modal attribute insertions are a realistic perturbation strategy for vision-and-language data.
We find that augmenting input text using cross-modal insertions causes state-of-the-art approaches for text-to-image retrieval and cross-modal entailment to perform poorly.
Crowd-sourced annotations demonstrate that cross-modal insertions lead to higher quality augmentations for multimodal data.
arXiv Detail & Related papers (2023-06-19T17:00:03Z) - Iterative Adversarial Attack on Image-guided Story Ending Generation [37.42908817585858]
Multimodal learning involves developing models that can integrate information from various sources like images and texts.
Deep neural networks, which are the backbone of recent IgSEG models, are vulnerable to adversarial samples.
We propose an iterative adversarial attack method (Iterative-attack) that fuses image and text modality attacks.
arXiv Detail & Related papers (2023-05-16T06:19:03Z) - Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks.
Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix.
Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z) - Robust Latent Representations via Cross-Modal Translation and Alignment [36.67937514793215]
Most multi-modal machine learning methods require that all the modalities used for training are also available for testing.
To address this limitation, we aim to improve the testing performance of uni-modal systems using multiple modalities during training only.
The proposed multi-modal training framework uses cross-modal translation and correlation-based latent space alignment.
arXiv Detail & Related papers (2020-11-03T11:18:04Z) - Multimodal Routing: Improving Local and Global Interpretability of
Multimodal Language Analysis [103.69656907534456]
Recent multimodal learning with strong performances on human-centric tasks are often black-box.
We propose Multimodal Routing, which adjusts weights between input modalities and output representations differently for each input sample.
arXiv Detail & Related papers (2020-04-29T13:42:22Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.