Does Vision-and-Language Pretraining Improve Lexical Grounding?
- URL: http://arxiv.org/abs/2109.10246v1
- Date: Tue, 21 Sep 2021 15:12:39 GMT
- Title: Does Vision-and-Language Pretraining Improve Lexical Grounding?
- Authors: Tian Yun, Chen Sun, Ellie Pavlick
- Abstract summary: Vision-and-Language models are trained jointly on text and image or video data.
It is not yet known how the internal linguistic representations themselves compare to their text-only counterparts.
- Score: 25.357191933430627
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Linguistic representations derived from text alone have been criticized for
their lack of grounding, i.e., connecting words to their meanings in the
physical world. Vision-and-Language (VL) models, trained jointly on text and
image or video data, have been offered as a response to such criticisms.
However, while VL pretraining has shown success on multimodal tasks such as
visual question answering, it is not yet known how the internal linguistic
representations themselves compare to their text-only counterparts. This paper
compares the semantic representations learned via VL vs. text-only pretraining
for two recent VL models using a suite of analyses (clustering, probing, and
performance on a commonsense question answering task) in a language-only
setting. We find that the multimodal models fail to significantly outperform
the text-only variants, suggesting that future work is required if multimodal
pretraining is to be pursued as a means of improving NLP in general.
Related papers
- A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives [13.581385765600265]
Pretrained language models (PLMs) display impressive performances and have captured the attention of the NLP community.
This paper proposes a comparison of multilingual pretraining objectives in a controlled methodological environment.
arXiv Detail & Related papers (2024-07-22T09:16:30Z) - Improving Visual Commonsense in Language Models via Multiple Image Generation [41.565399860320966]
Existing large language models (LLMs) are primarily trained using textual data only.
Visual Language Models, which excel at visually-oriented tasks, often fail at non-visual tasks such as basic commonsense reasoning.
This divergence highlights a critical challenge - the integration of robust visual understanding with foundational text-based language reasoning.
arXiv Detail & Related papers (2024-06-19T15:17:10Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training [84.23022072347821]
We propose a regularized cross-lingual visio-textual contrastive learning objective that constrains the representation proximity of weakly-aligned visio-textual inputs.
Experiments on 5 downstream multi-modal tasks across 6 languages demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-05-13T14:41:05Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - How to Adapt Pre-trained Vision-and-Language Models to a Text-only
Input? [0.13706331473063876]
We focus on pre-trained multimodal vision-and-language (VL) models for which there already are some results on their language understanding capabilities.
An unresolved issue with evaluating the linguistic skills of these models is that there is no established method for adapting them to text-only input without out-of-distribution uncertainty.
Our evaluations on both GLUE and Visual Property Norms (VPN) show that care should be put into adapting VL models to zero-shot text-only tasks, while the models are less sensitive to how we adapt them to non-zero-shot tasks.
arXiv Detail & Related papers (2022-09-19T13:00:12Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.