Is Multimodal Vision Supervision Beneficial to Language?
- URL: http://arxiv.org/abs/2302.05016v2
- Date: Sat, 15 Apr 2023 00:04:54 GMT
- Title: Is Multimodal Vision Supervision Beneficial to Language?
- Authors: Avinash Madasu, Vasudev Lal
- Abstract summary: Vision (image and video) pre-training is the recent popular paradigm that achieved state-of-the-art results on multi-modal tasks.
We compare the performance of language representations of stand-alone text encoders of these models to the language representations of text encoders learnt through vision supervision.
- Score: 2.216702991322677
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision (image and video) - Language (VL) pre-training is the recent popular
paradigm that achieved state-of-the-art results on multi-modal tasks like
image-retrieval, video-retrieval, visual question answering etc. These models
are trained in an unsupervised way and greatly benefit from the complementary
modality supervision. In this paper, we explore if the language representations
trained using vision supervision perform better than vanilla language
representations on Natural Language Understanding and commonsense reasoning
benchmarks. We experiment with a diverse set of image-text models such as
ALBEF, BLIP, METER and video-text models like ALPRO, Frozen-in-Time (FiT),
VIOLET. We compare the performance of language representations of stand-alone
text encoders of these models to the language representations of text encoders
learnt through vision supervision. Our experiments suggest that vanilla
language representations show superior performance on most of the tasks. These
results shed light on the current drawbacks of the vision-language models.
Related papers
- CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - DeViL: Decoding Vision features into Language [53.88202366696955]
Post-hoc explanation methods have often been criticised for abstracting away the decision-making process of deep neural networks.
In this work, we would like to provide natural language descriptions for what different layers of a vision backbone have learned.
We train a transformer network to translate individual image features of any vision layer into a prompt that a separate off-the-shelf language model decodes into natural language.
arXiv Detail & Related papers (2023-09-04T13:59:55Z) - Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining
on Visual Language Understanding [13.300199242824934]
We investigate whether vision-and-language pretraining can improve performance on text-only tasks that involve implicit visual reasoning.
We propose a suite of visual language understanding tasks for probing the visual reasoning abilities of text encoder models.
We also contribute a novel zero-shot knowledge probing method, Stroop probing, for applying models such as CLIP to text-only tasks.
arXiv Detail & Related papers (2023-03-21T17:30:40Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - VLMAE: Vision-Language Masked Autoencoder [21.97700040013084]
We propose a vision-language masked autoencoder framework (VLMAE) for vision-language pre-training.
VLMAE employs visual generative learning, facilitating the model to acquire fine-grained and unbiased features.
arXiv Detail & Related papers (2022-08-19T14:39:18Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.