Benchmarking Robustness of Multimodal Image-Text Models under
Distribution Shift
- URL: http://arxiv.org/abs/2212.08044v3
- Date: Fri, 19 Jan 2024 15:29:34 GMT
- Title: Benchmarking Robustness of Multimodal Image-Text Models under
Distribution Shift
- Authors: Jielin Qiu, Yi Zhu, Xingjian Shi, Florian Wenzel, Zhiqiang Tang, Ding
Zhao, Bo Li, Mu Li
- Abstract summary: We investigate the robustness of 12 popular open-sourced image-text models under common perturbations on five tasks.
Character-level perturbations constitute the most severe distribution shift for text, and zoom blur is the most severe shift for image data.
- Score: 50.64474103506595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal image-text models have shown remarkable performance in the past
few years. However, evaluating robustness against distribution shifts is
crucial before adopting them in real-world applications. In this work, we
investigate the robustness of 12 popular open-sourced image-text models under
common perturbations on five tasks (image-text retrieval, visual reasoning,
visual entailment, image captioning, and text-to-image generation). In
particular, we propose several new multimodal robustness benchmarks by applying
17 image perturbation and 16 text perturbation techniques on top of existing
datasets. We observe that multimodal models are not robust to image and text
perturbations, especially to image perturbations. Among the tested perturbation
methods, character-level perturbations constitute the most severe distribution
shift for text, and zoom blur is the most severe shift for image data. We also
introduce two new robustness metrics (\textbf{MMI} for MultiModal Impact score
and \textbf{MOR} for Missing Object Rate) for proper evaluations of multimodal
models. We hope our extensive study sheds light on new directions for the
development of robust multimodal models. More details can be found on the
project webpage: \url{https://MMRobustness.github.io}.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Make Imagination Clearer! Stable Diffusion-based Visual Imagination for Multimodal Machine Translation [40.42326040668964]
We introduce a stable diffusion-based imagination network into a multimodal large language model (MLLM) to explicitly generate an image for each source sentence.
We build human feedback with reinforcement learning to ensure the consistency of the generated image with the source sentence.
Experimental results show that our model significantly outperforms existing multimodal MT and text-only MT.
arXiv Detail & Related papers (2024-12-17T07:41:23Z) - Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion [3.399289369740637]
This paper presents a pioneering study on post-training pruning of Stable Diffusion 2.
It addresses the critical need for model compression in text-to-image domain.
We propose an optimal pruning configuration that prunes the text encoder to 47.5% and the diffusion generator to 35%.
arXiv Detail & Related papers (2024-11-22T18:29:37Z) - Debiasing Vison-Language Models with Text-Only Training [15.069736314663352]
We propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
To address the limitations, we propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
arXiv Detail & Related papers (2024-10-12T04:34:46Z) - Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers [120.49126407479717]
This paper explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR)
We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos.
arXiv Detail & Related papers (2024-03-12T00:02:03Z) - On the Multi-modal Vulnerability of Diffusion Models [56.08923332178462]
We propose MMP-Attack to manipulate the generation results of diffusion models by appending a specific suffix to the original prompt.
Our goal is to induce diffusion models to generate a specific object while simultaneously eliminating the original object.
arXiv Detail & Related papers (2024-02-02T12:39:49Z) - Multimodal Foundation Models Exploit Text to Make Medical Image Predictions [3.4230952713864373]
We evaluate the mechanisms by which multimodal foundation models integrate and prioritize different data modalities, including images and text.
Our results suggest that multimodal AI models may be useful in medical diagnostic reasoning but that their accuracy is largely driven, for better and worse, by their exploitation of text.
arXiv Detail & Related papers (2023-11-09T18:48:02Z) - Towards Better Multi-modal Keyphrase Generation via Visual Entity
Enhancement and Multi-granularity Image Noise Filtering [79.44443231700201]
Multi-modal keyphrase generation aims to produce a set of keyphrases that represent the core points of the input text-image pair.
The input text and image are often not perfectly matched, and thus the image may introduce noise into the model.
We propose a novel multi-modal keyphrase generation model, which not only enriches the model input with external knowledge, but also effectively filters image noise.
arXiv Detail & Related papers (2023-09-09T09:41:36Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Iterative Adversarial Attack on Image-guided Story Ending Generation [37.42908817585858]
Multimodal learning involves developing models that can integrate information from various sources like images and texts.
Deep neural networks, which are the backbone of recent IgSEG models, are vulnerable to adversarial samples.
We propose an iterative adversarial attack method (Iterative-attack) that fuses image and text modality attacks.
arXiv Detail & Related papers (2023-05-16T06:19:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.