Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining
on Visual Language Understanding
- URL: http://arxiv.org/abs/2303.12513v2
- Date: Thu, 2 Nov 2023 10:38:49 GMT
- Title: Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining
on Visual Language Understanding
- Authors: Morris Alper, Michael Fiman, Hadar Averbuch-Elor
- Abstract summary: We investigate whether vision-and-language pretraining can improve performance on text-only tasks that involve implicit visual reasoning.
We propose a suite of visual language understanding tasks for probing the visual reasoning abilities of text encoder models.
We also contribute a novel zero-shot knowledge probing method, Stroop probing, for applying models such as CLIP to text-only tasks.
- Score: 13.300199242824934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most humans use visual imagination to understand and reason about language,
but models such as BERT reason about language using knowledge acquired during
text-only pretraining. In this work, we investigate whether vision-and-language
pretraining can improve performance on text-only tasks that involve implicit
visual reasoning, focusing primarily on zero-shot probing methods. We propose a
suite of visual language understanding (VLU) tasks for probing the visual
reasoning abilities of text encoder models, as well as various non-visual
natural language understanding (NLU) tasks for comparison. We also contribute a
novel zero-shot knowledge probing method, Stroop probing, for applying models
such as CLIP to text-only tasks without needing a prediction head such as the
masked language modelling head of models like BERT. We show that SOTA
multimodally trained text encoders outperform unimodally trained text encoders
on the VLU tasks while being underperformed by them on the NLU tasks, lending
new context to previously mixed results regarding the NLU capabilities of
multimodal models. We conclude that exposure to images during pretraining
affords inherent visual reasoning knowledge that is reflected in language-only
tasks that require implicit visual reasoning. Our findings bear importance in
the broader context of multimodal learning, providing principled guidelines for
the choice of text encoders used in such contexts.
Related papers
- Improving Visual Commonsense in Language Models via Multiple Image Generation [41.565399860320966]
Existing large language models (LLMs) are primarily trained using textual data only.
Visual Language Models, which excel at visually-oriented tasks, often fail at non-visual tasks such as basic commonsense reasoning.
This divergence highlights a critical challenge - the integration of robust visual understanding with foundational text-based language reasoning.
arXiv Detail & Related papers (2024-06-19T15:17:10Z) - UniFine: A Unified and Fine-grained Approach for Zero-shot
Vision-Language Understanding [84.83494254263138]
We propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning.
Our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR.
arXiv Detail & Related papers (2023-07-03T09:03:12Z) - Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages [3.3227703089509304]
We propose a simple yet efficient approach to adapt Vision-Language Pre-training to unseen languages using MPLM.
Our approach does not require image input and primarily uses machine translation, eliminating the need for target language data.
arXiv Detail & Related papers (2023-06-29T08:20:57Z) - Is Multimodal Vision Supervision Beneficial to Language? [2.216702991322677]
Vision (image and video) pre-training is the recent popular paradigm that achieved state-of-the-art results on multi-modal tasks.
We compare the performance of language representations of stand-alone text encoders of these models to the language representations of text encoders learnt through vision supervision.
arXiv Detail & Related papers (2023-02-10T02:22:44Z) - I Can't Believe There's No Images! Learning Visual Tasks Using only
Language Supervision [32.49636188029509]
We produce models using only text training data on four representative tasks.
We find these models perform close to models trained on images.
We showcase a variety of stylistic image captioning models that are trained using no image data and no human-curated language data.
arXiv Detail & Related papers (2022-11-17T18:52:19Z) - CLIP also Understands Text: Prompting CLIP for Phrase Understanding [65.59857372525664]
Contrastive Language-Image Pretraining (CLIP) efficiently learns visual concepts by pre-training with natural language supervision.
In this paper, we find that the text encoder of CLIP actually demonstrates strong ability for phrase understanding, and can even significantly outperform popular language models such as BERT with a properly designed prompt.
arXiv Detail & Related papers (2022-10-11T23:35:18Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Unified Multimodal Pre-training and Prompt-based Tuning for
Vision-Language Understanding and Generation [86.26522210882699]
We propose Unified multimodal pre-training for both Vision-Language understanding and generation.
The proposed UniVL is capable of handling both understanding tasks and generative tasks.
Our experiments show that there is a trade-off between understanding tasks and generation tasks while using the same model.
arXiv Detail & Related papers (2021-12-10T14:59:06Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.