How to Adapt Pre-trained Vision-and-Language Models to a Text-only
Input?
- URL: http://arxiv.org/abs/2209.08982v1
- Date: Mon, 19 Sep 2022 13:00:12 GMT
- Title: How to Adapt Pre-trained Vision-and-Language Models to a Text-only
Input?
- Authors: Lovisa Hagstr\"om, Richard Johansson
- Abstract summary: We focus on pre-trained multimodal vision-and-language (VL) models for which there already are some results on their language understanding capabilities.
An unresolved issue with evaluating the linguistic skills of these models is that there is no established method for adapting them to text-only input without out-of-distribution uncertainty.
Our evaluations on both GLUE and Visual Property Norms (VPN) show that care should be put into adapting VL models to zero-shot text-only tasks, while the models are less sensitive to how we adapt them to non-zero-shot tasks.
- Score: 0.13706331473063876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current language models have been criticised for learning language from text
alone without connection between words and their meaning. Consequently,
multimodal training has been proposed as a way for creating models with better
language understanding by providing the lacking connection. We focus on
pre-trained multimodal vision-and-language (VL) models for which there already
are some results on their language understanding capabilities. An unresolved
issue with evaluating the linguistic skills of these models, however, is that
there is no established method for adapting them to text-only input without
out-of-distribution uncertainty. To find the best approach, we investigate and
compare seven possible methods for adapting three different pre-trained VL
models to text-only input. Our evaluations on both GLUE and Visual Property
Norms (VPN) show that care should be put into adapting VL models to zero-shot
text-only tasks, while the models are less sensitive to how we adapt them to
non-zero-shot tasks. We also find that the adaptation methods perform
differently for different models and that unimodal model counterparts perform
on par with the VL models regardless of adaptation, indicating that current VL
models do not necessarily gain better language understanding from their
multimodal training.
Related papers
- VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation [82.5217996570387]
We adapt a pre-trained language model for auto-regressive text-to-image generation.
We find that pre-trained language models offer limited help.
arXiv Detail & Related papers (2023-11-27T07:19:26Z) - MetaVL: Transferring In-Context Learning Ability From Language Models to
Vision-Language Models [74.89629463600978]
In vision-language domain, most large-scale pre-trained vision-language models do not possess the ability to conduct in-context learning.
In this paper, we study an interesting hypothesis: can we transfer the in-context learning ability from the language domain to the vision domain?
arXiv Detail & Related papers (2023-06-02T07:21:03Z) - Teaching Structured Vision&Language Concepts to Vision&Language Models [46.344585368641006]
We introduce the collective notion of Structured Vision&Language Concepts (SVLC)
SVLC includes object attributes, relations, and states which are present in the text and visible in the image.
We propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs.
arXiv Detail & Related papers (2022-11-21T18:54:10Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - MAGMA -- Multimodal Augmentation of Generative Models through
Adapter-based Finetuning [11.339580074756189]
MAGMA is a simple method for augmenting generative language models with additional modalities using adapter-based finetuning.
We train a series of VL models that autoregressively generate text from arbitrary combinations of visual and textual input.
arXiv Detail & Related papers (2021-12-09T23:58:45Z) - Does Vision-and-Language Pretraining Improve Lexical Grounding? [25.357191933430627]
Vision-and-Language models are trained jointly on text and image or video data.
It is not yet known how the internal linguistic representations themselves compare to their text-only counterparts.
arXiv Detail & Related papers (2021-09-21T15:12:39Z) - Unsupervised Vision-and-Language Pre-training Without Parallel Images
and Captions [92.47566804182338]
We investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora.
In particular, we propose to conduct mask-and-predict'' pre-training on text-only and image-only corpora.
We find that such a simple approach performance close to a model pre-trained with aligned data, on four English V&L benchmarks.
arXiv Detail & Related papers (2020-10-24T08:17:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.