Seeing past words: Testing the cross-modal capabilities of pretrained
V&L models
- URL: http://arxiv.org/abs/2012.12352v1
- Date: Tue, 22 Dec 2020 21:01:44 GMT
- Title: Seeing past words: Testing the cross-modal capabilities of pretrained
V&L models
- Authors: Letitia Parcalabescu and Albert Gatt and Anette Frank and Iacer
Calixto
- Abstract summary: We investigate the ability of general-purpose pretrained vision and language V&L models to perform reasoning in two tasks that require multimodal integration.
We evaluate three pretrained V&L models on these tasks: ViLBERT, ViLBERT 12-in-1 and LXMERT.
Our investigations suggest that pretrained V&L representations are less successful than expected at integrating the two modalities.
- Score: 18.73444918172383
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate the ability of general-purpose pretrained vision and language
V&L models to perform reasoning in two tasks that require multimodal
integration: (1) discriminating a correct image-sentence pair from an incorrect
one, and (2) counting entities in an image. We evaluate three pretrained V&L
models on these tasks: ViLBERT, ViLBERT 12-in-1 and LXMERT, in zero-shot and
finetuned settings. Our results show that models solve task (1) very well, as
expected, since all models use task (1) for pretraining. However, none of the
pretrained V&L models are able to adequately solve task (2), our counting
probe, and they cannot generalise to out-of-distribution quantities. Our
investigations suggest that pretrained V&L representations are less successful
than expected at integrating the two modalities. We propose a number of
explanations for these findings: LXMERT's results on the image-sentence
alignment task (and to a lesser extent those obtained by ViLBERT 12-in-1)
indicate that the model may exhibit catastrophic forgetting. As for our results
on the counting probe, we find evidence that all models are impacted by dataset
bias, and also fail to individuate entities in the visual input.
Related papers
- Aligning Modalities in Vision Large Language Models via Preference
Fine-tuning [67.62925151837675]
In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning.
Specifically, we propose POVID to generate feedback data with AI models.
We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data.
In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches.
arXiv Detail & Related papers (2024-02-18T00:56:16Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - LXMERT Model Compression for Visual Question Answering [0.03749861135832073]
We show that LXMERT can be effectively pruned by 40%-60% in size with 3% loss in accuracy.
Our experiment results demonstrate that LXMERT can be effectively pruned by 40%-60% in size with 3% loss in accuracy.
arXiv Detail & Related papers (2023-10-23T19:46:41Z) - UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z) - Composing Ensembles of Pre-trained Models via Iterative Consensus [95.10641301155232]
We propose a unified framework for composing ensembles of different pre-trained models.
We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization.
We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer.
arXiv Detail & Related papers (2022-10-20T18:46:31Z) - TraVLR: Now You See It, Now You Don't! A Bimodal Dataset for Evaluating
Visio-Linguistic Reasoning [25.520406167426135]
We present TraVLR, a synthetic dataset comprising four visio-linguistic (V+L) reasoning tasks.
Each example in TraVLR redundantly encodes the scene in two modalities, allowing either to be dropped or added during training or testing without losing relevant information.
We compare the performance of four state-of-the-art V+L models, finding that while they perform well on test examples from the same modality, they all fail at cross-modal transfer.
arXiv Detail & Related papers (2021-11-21T07:22:44Z) - Playing Lottery Tickets with Vision and Language [62.6420670250559]
Large-scale transformer-based pre-training has revolutionized vision-and-language (V+L) research.
In parallel, work on the lottery ticket hypothesis has shown that deep neural networks contain small matchingworks that can achieve on par or even better performance than the dense networks when trained in isolation.
We use UNITER, one of the best-performing V+L models, as the testbed, and consolidate 7 representative V+L tasks for experiments.
arXiv Detail & Related papers (2021-04-23T22:24:33Z) - Unsupervised Vision-and-Language Pre-training Without Parallel Images
and Captions [92.47566804182338]
We investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora.
In particular, we propose to conduct mask-and-predict'' pre-training on text-only and image-only corpora.
We find that such a simple approach performance close to a model pre-trained with aligned data, on four English V&L benchmarks.
arXiv Detail & Related papers (2020-10-24T08:17:54Z) - Assisting Scene Graph Generation with Self-Supervision [21.89909688056478]
We propose a set of three novel yet simple self-supervision tasks and train them as auxiliary multi-tasks to the main model.
While comparing, we train the base-model from scratch with these self-supervision tasks, we achieve state-of-the-art results in all the metrics and recall settings.
arXiv Detail & Related papers (2020-08-08T16:38:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.