Enhance the Robustness of Text-Centric Multimodal Alignments
- URL: http://arxiv.org/abs/2407.05036v1
- Date: Sat, 6 Jul 2024 10:12:29 GMT
- Title: Enhance the Robustness of Text-Centric Multimodal Alignments
- Authors: Ting-Yu Yen, Yun-Da Tsai, Keng-Te Liao, Shou-De Lin,
- Abstract summary: This study assesses the quality and robustness of multimodal representations in the presence of missing entries, noise, or absent modalities.
We propose a new text-centric approach that achieves superior robustness compared to previous methods.
- Score: 4.985886792128721
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Converting different modalities into general text, serving as input prompts for large language models (LLMs), is a common method to align multimodal models when there is limited pairwise data. This text-centric approach leverages the unique properties of text as a modality space, transforming diverse inputs into a unified textual representation. This enables downstream models to effectively interpret various modal inputs. This study assesses the quality and robustness of multimodal representations in the presence of missing entries, noise, or absent modalities, revealing that current text-centric alignment methods compromise downstream robustness. To address this issue, we propose a new text-centric approach that achieves superior robustness compared to previous methods across various modalities in different settings. Our findings highlight the potential of this approach to enhance the robustness and adaptability of multimodal representations, offering a promising solution for dynamic and real-world applications.
Related papers
- Enhance Modality Robustness in Text-Centric Multimodal Alignment with Adversarial Prompting [4.985886792128721]
This study evaluates the quality and robustness of multimodal representations in the face of noise imperfections, dynamic input order permutations, and missing modalities.
We propose a new text-centric adversarial training approach that significantly enhances robustness compared to traditional robust training methods and pre-trained multimodal foundation models.
arXiv Detail & Related papers (2024-08-19T08:44:55Z) - TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models [69.49978333446538]
TEAL is an approach to treat the input from any modality as a token sequence.
It embeds the token sequence into a joint embedding space with a learnable embedding matrix.
Experiments show that TEAL achieves substantial improvements in multi-modal understanding.
arXiv Detail & Related papers (2023-11-08T10:34:16Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects.
In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts.
We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - Cross-Modal Attribute Insertions for Assessing the Robustness of
Vision-and-Language Learning [9.949354222717773]
Cross-modal attribute insertions are a realistic perturbation strategy for vision-and-language data.
We find that augmenting input text using cross-modal insertions causes state-of-the-art approaches for text-to-image retrieval and cross-modal entailment to perform poorly.
Crowd-sourced annotations demonstrate that cross-modal insertions lead to higher quality augmentations for multimodal data.
arXiv Detail & Related papers (2023-06-19T17:00:03Z) - Iterative Adversarial Attack on Image-guided Story Ending Generation [37.42908817585858]
Multimodal learning involves developing models that can integrate information from various sources like images and texts.
Deep neural networks, which are the backbone of recent IgSEG models, are vulnerable to adversarial samples.
We propose an iterative adversarial attack method (Iterative-attack) that fuses image and text modality attacks.
arXiv Detail & Related papers (2023-05-16T06:19:03Z) - Towards Unifying Medical Vision-and-Language Pre-training via Soft
Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used.
We propose an effective yet straightforward scheme named PTUnifier to unify the two types.
We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z) - Robust Latent Representations via Cross-Modal Translation and Alignment [36.67937514793215]
Most multi-modal machine learning methods require that all the modalities used for training are also available for testing.
To address this limitation, we aim to improve the testing performance of uni-modal systems using multiple modalities during training only.
The proposed multi-modal training framework uses cross-modal translation and correlation-based latent space alignment.
arXiv Detail & Related papers (2020-11-03T11:18:04Z) - Multimodal Routing: Improving Local and Global Interpretability of
Multimodal Language Analysis [103.69656907534456]
Recent multimodal learning with strong performances on human-centric tasks are often black-box.
We propose Multimodal Routing, which adjusts weights between input modalities and output representations differently for each input sample.
arXiv Detail & Related papers (2020-04-29T13:42:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.