Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in
Multimodal Transformers
- URL: http://arxiv.org/abs/2109.04448v1
- Date: Thu, 9 Sep 2021 17:47:50 GMT
- Title: Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in
Multimodal Transformers
- Authors: Stella Frank, Emanuele Bugliarello, Desmond Elliott
- Abstract summary: Pretrained vision-and-language BERTs aim to learn representations that combine information from both modalities.
We propose a diagnostic method based on cross-modal input ablation to assess the extent to which these models actually integrate cross-modal information.
- Score: 15.826109118064716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained vision-and-language BERTs aim to learn representations that
combine information from both modalities. We propose a diagnostic method based
on cross-modal input ablation to assess the extent to which these models
actually integrate cross-modal information. This method involves ablating
inputs from one modality, either entirely or selectively based on cross-modal
grounding alignments, and evaluating the model prediction performance on the
other modality. Model performance is measured by modality-specific tasks that
mirror the model pretraining objectives (e.g. masked language modelling for
text). Models that have learned to construct cross-modal representations using
both modalities are expected to perform worse when inputs are missing from a
modality. We find that recently proposed models have much greater relative
difficulty predicting text when visual information is ablated, compared to
predicting visual object categories when text is ablated, indicating that these
models are not symmetrically cross-modal.
Related papers
- Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset [0.39462888523270856]
We propose VAGUE, a multimodal benchmark comprising 3.9K indirect human utterances paired with corresponding scenes.
Our work aims to delve deeper into the ability of models to understand indirect communication and seek to contribute to the development of models capable of more refined and human-like interactions.
arXiv Detail & Related papers (2024-11-21T14:01:42Z) - Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Visual Explanations of Image-Text Representations via Multi-Modal Information Bottleneck Attribution [49.762034744605955]
We propose a multi-modal information bottleneck approach to improve interpretability of vision-language models.
We demonstrate how M2IB can be applied to attribution analysis of vision-language pretrained models.
arXiv Detail & Related papers (2023-12-28T18:02:22Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - Cross-Modal Attribute Insertions for Assessing the Robustness of
Vision-and-Language Learning [9.949354222717773]
Cross-modal attribute insertions are a realistic perturbation strategy for vision-and-language data.
We find that augmenting input text using cross-modal insertions causes state-of-the-art approaches for text-to-image retrieval and cross-modal entailment to perform poorly.
Crowd-sourced annotations demonstrate that cross-modal insertions lead to higher quality augmentations for multimodal data.
arXiv Detail & Related papers (2023-06-19T17:00:03Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models [39.479912987123214]
Self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks.
We introduce Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept.
We show that, the proposed fusion approach is effective to any pair of visual and language models, even those pre-trained on a corpus of uni-modal data.
arXiv Detail & Related papers (2022-10-27T02:57:26Z) - Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal
Pre-training [21.017471684853987]
We introduce Cross-View Language Modeling, a simple and effective pre-training framework that unifies cross-lingual and cross-modal pre-training.
Our approach is motivated by a key observation that cross-lingual and cross-modal pre-training share the same goal of aligning two different views of the same object into a common semantic space.
CLM is the first multi-lingual multi-modal pre-trained model that surpasses the translate-test performance of representative English vision-language models by zero-shot cross-lingual transfer.
arXiv Detail & Related papers (2022-06-01T16:45:24Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.