Beyond Image-Text Matching: Verb Understanding in Multimodal
Transformers Using Guided Masking
- URL: http://arxiv.org/abs/2401.16575v1
- Date: Mon, 29 Jan 2024 21:22:23 GMT
- Title: Beyond Image-Text Matching: Verb Understanding in Multimodal
Transformers Using Guided Masking
- Authors: Ivana Be\v{n}ov\'a, Jana Ko\v{s}eck\'a, Michal Gregor, Martin Tamajka,
Marcel Vesel\'y, Mari\'an \v{S}imko
- Abstract summary: This work introduces an alternative probing strategy called guided masking.
The proposed approach ablates different modalities using masking and assesses the model's ability to predict the masked word with high accuracy.
We show that guided masking on ViLBERT, LXMERT, UNITER, and VisualBERT can predict the correct verb with high accuracy.
- Score: 0.4543820534430524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The dominant probing approaches rely on the zero-shot performance of
image-text matching tasks to gain a finer-grained understanding of the
representations learned by recent multimodal image-language transformer models.
The evaluation is carried out on carefully curated datasets focusing on
counting, relations, attributes, and others. This work introduces an
alternative probing strategy called guided masking. The proposed approach
ablates different modalities using masking and assesses the model's ability to
predict the masked word with high accuracy. We focus on studying multimodal
models that consider regions of interest (ROI) features obtained by object
detectors as input tokens. We probe the understanding of verbs using guided
masking on ViLBERT, LXMERT, UNITER, and VisualBERT and show that these models
can predict the correct verb with high accuracy. This contrasts with previous
conclusions drawn from image-text matching probing techniques that frequently
fail in situations requiring verb understanding. The code for all experiments
will be publicly available https://github.com/ivana-13/guided_masking.
Related papers
- Exploring Simple Open-Vocabulary Semantic Segmentation [7.245983878396646]
Open-vocabulary semantic segmentation models aim to accurately assign a semantic label to each pixel in an image from a set of arbitrary open-vocabulary texts.
In this paper, we introduce S-Seg, a novel model that can achieve surprisingly strong performance without depending on any of the above elements.
arXiv Detail & Related papers (2024-01-22T18:59:29Z) - Towards Better Multi-modal Keyphrase Generation via Visual Entity
Enhancement and Multi-granularity Image Noise Filtering [79.44443231700201]
Multi-modal keyphrase generation aims to produce a set of keyphrases that represent the core points of the input text-image pair.
The input text and image are often not perfectly matched, and thus the image may introduce noise into the model.
We propose a novel multi-modal keyphrase generation model, which not only enriches the model input with external knowledge, but also effectively filters image noise.
arXiv Detail & Related papers (2023-09-09T09:41:36Z) - Multi-Modal Mutual Attention and Iterative Interaction for Referring
Image Segmentation [49.6153714376745]
We address the problem of referring image segmentation that aims to generate a mask for the object specified by a natural language expression.
We propose Multi-Modal Mutual Attention ($mathrmM3Att$) and Multi-Modal Mutual Decoder ($mathrmM3Dec$) that better fuse information from the two input modalities.
arXiv Detail & Related papers (2023-05-24T16:26:05Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - Leveraging per Image-Token Consistency for Vision-Language Pre-training [52.825150269820696]
Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training.
We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training)
The proposed EPIC method is easily combined with pre-training methods.
arXiv Detail & Related papers (2022-11-20T12:10:53Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - RpBERT: A Text-image Relation Propagation-based BERT Model for
Multimodal NER [4.510210055307459]
multimodal named entity recognition (MNER) has utilized images to improve the accuracy of NER in tweets.
We introduce a method of text-image relation propagation into the multimodal BERT model.
We propose a multitask algorithm to train on the MNER datasets.
arXiv Detail & Related papers (2021-02-05T02:45:30Z) - Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance.
We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images.
Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.