Cross-Modal Attribute Insertions for Assessing the Robustness of
  Vision-and-Language Learning
        - URL: http://arxiv.org/abs/2306.11065v1
- Date: Mon, 19 Jun 2023 17:00:03 GMT
- Title: Cross-Modal Attribute Insertions for Assessing the Robustness of
  Vision-and-Language Learning
- Authors: Shivaen Ramshetty, Gaurav Verma, Srijan Kumar
- Abstract summary: Cross-modal attribute insertions are a realistic perturbation strategy for vision-and-language data.
We find that augmenting input text using cross-modal insertions causes state-of-the-art approaches for text-to-image retrieval and cross-modal entailment to perform poorly.
Crowd-sourced annotations demonstrate that cross-modal insertions lead to higher quality augmentations for multimodal data.
- Score: 9.949354222717773
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   The robustness of multimodal deep learning models to realistic changes in the
input text is critical for their applicability to important tasks such as
text-to-image retrieval and cross-modal entailment. To measure robustness,
several existing approaches edit the text data, but do so without leveraging
the cross-modal information present in multimodal data. Information from the
visual modality, such as color, size, and shape, provide additional attributes
that users can include in their inputs. Thus, we propose cross-modal attribute
insertions as a realistic perturbation strategy for vision-and-language data
that inserts visual attributes of the objects in the image into the
corresponding text (e.g., "girl on a chair" to "little girl on a wooden
chair"). Our proposed approach for cross-modal attribute insertions is modular,
controllable, and task-agnostic. We find that augmenting input text using
cross-modal insertions causes state-of-the-art approaches for text-to-image
retrieval and cross-modal entailment to perform poorly, resulting in relative
drops of 15% in MRR and 20% in $F_1$ score, respectively. Crowd-sourced
annotations demonstrate that cross-modal insertions lead to higher quality
augmentations for multimodal data than augmentations using text-only data, and
are equivalent in quality to original examples. We release the code to
encourage robustness evaluations of deep vision-and-language models:
https://github.com/claws-lab/multimodal-robustness-xmai.
 
      
        Related papers
        - Vision-Language Models Struggle to Align Entities across Modalities [13.100184125419695]
 Cross-modal entity linking is a fundamental skill needed for real-world applications such as multimodal code generation.
Our benchmark, MATE, consists of 5.5k evaluation instances featuring visual scenes aligned with their textual representations.
We evaluate state-of-the-art Vision-Language Models (VLMs) and humans on this task, and find thatVLMs struggle significantly compared to humans.
 arXiv  Detail & Related papers  (2025-03-05T19:36:43Z)
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
 We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
 arXiv  Detail & Related papers  (2025-02-18T12:00:47Z)
- ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
 Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
 ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
 arXiv  Detail & Related papers  (2024-08-19T15:27:25Z)
- Enhance the Robustness of Text-Centric Multimodal Alignments [4.985886792128721]
 This study assesses the quality and robustness of multimodal representations in the presence of missing entries, noise, or absent modalities.
We propose a new text-centric approach that achieves superior robustness compared to previous methods.
 arXiv  Detail & Related papers  (2024-07-06T10:12:29Z)
- MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching [65.87255122130188]
 We propose a Multi-view Attention Method (MVAM) for image-text matching.
We also incorporate an objective to explicitly encourage attention heads to focus on distinct aspects of the input data.
Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance.
 arXiv  Detail & Related papers  (2024-02-27T06:11:54Z)
- Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image
  Person Retrieval [29.884153827619915]
 We present IRRA: a cross-modal Implicit Relation Reasoning and Aligning framework.
It learns relations between local visual-textual tokens and enhances global image-text matching.
The proposed method achieves new state-of-the-art results on all three public datasets.
 arXiv  Detail & Related papers  (2023-03-22T12:11:59Z)
- Towards Unifying Medical Vision-and-Language Pre-training via Soft
  Prompts [63.84720380390935]
 There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used.
We propose an effective yet straightforward scheme named PTUnifier to unify the two types.
We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
 arXiv  Detail & Related papers  (2023-02-17T15:43:42Z)
- FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
  Retrieval and Captioning [66.38951790650887]
 Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
 arXiv  Detail & Related papers  (2022-10-26T21:01:19Z)
- ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text
  Pre-training [40.05046655477684]
 ERNIE-ViL 2.0 is a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously.
We construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs.
 ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval.
 arXiv  Detail & Related papers  (2022-09-30T07:20:07Z)
- Correlation Information Bottleneck: Towards Adapting Pretrained
  Multimodal Models for Robust Visual Question Answering [63.87200781247364]
 Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations.
We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
 arXiv  Detail & Related papers  (2022-09-14T22:04:10Z)
- ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition [38.08486689940946]
 Multi-modal Named Entity Recognition (MNER) has attracted a lot of attention.
It is difficult to model such interactions as image and text representations are trained separately on the data of their respective modality.
In this paper, we propose bf Image-bf text bf Alignments (ITA) to align image features into the textual space.
 arXiv  Detail & Related papers  (2021-12-13T08:29:43Z)
- Generating More Pertinent Captions by Leveraging Semantics and Style on
  Multi-Source Datasets [56.018551958004814]
 This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
 arXiv  Detail & Related papers  (2021-11-24T19:00:05Z)
- Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in
  Multimodal Transformers [15.826109118064716]
 Pretrained vision-and-language BERTs aim to learn representations that combine information from both modalities.
We propose a diagnostic method based on cross-modal input ablation to assess the extent to which these models actually integrate cross-modal information.
 arXiv  Detail & Related papers  (2021-09-09T17:47:50Z)
- Exploiting BERT For Multimodal Target SentimentClassification Through
  Input Space Translation [75.82110684355979]
 We introduce a two-stream model that translates images in input space using an object-aware transformer.
We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model.
We achieve state-of-the-art performance on two multimodal Twitter datasets.
 arXiv  Detail & Related papers  (2021-08-03T18:02:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.