Effect of Vision-and-Language Extensions on Natural Language
Understanding in Vision-and-Language Models
- URL: http://arxiv.org/abs/2104.08066v1
- Date: Fri, 16 Apr 2021 12:28:50 GMT
- Title: Effect of Vision-and-Language Extensions on Natural Language
Understanding in Vision-and-Language Models
- Authors: Taichi Iki, Akiko Aizawa
- Abstract summary: This paper investigates how visual extension affects the language capability of V&L models using the GLUE benchmark.
We found that visual extension causes some decreases in language capability and that V&L pretraining has a greater impact than structural modifications on the decreases.
- Score: 24.5834345625595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Extending language models with structural modifications and
vision-and-language (V&L) pretraining are successful ways of making V&L models
that can ground vision and language. Potential applications of these advanced
models include multi-modal machine reading comprehension models and multi-modal
dialogue models, which require language ability upon grounding. Although
language capability is crucial for such applications, the impact of extending
their visual capabilities on their language capabilities is not fully
understood. This paper investigates how visual extension affects the language
capability of V&L models using the GLUE benchmark. We found that visual
extension causes some decreases in language capability and that V&L pretraining
has a greater impact than structural modifications on the decreases. Our
results suggest the need for further study on pretraining that can maintain or,
if possible, improve a model's language capability.
Related papers
- LlamaTurk: Adapting Open-Source Generative Large Language Models for Low-Resource Language [2.9914612342004503]
This study explores an alternative solution by adapting large language models, primarily trained on English, to low-resource languages.
We assess various strategies, including continual training, instruction fine-tuning, task-specific fine-tuning, and vocabulary extension.
The results show that continual training improves language comprehension, as reflected in perplexity scores, and task-specific tuning generally enhances performance of downstream tasks.
arXiv Detail & Related papers (2024-05-13T13:41:59Z) - Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling [47.7950860342515]
LexiContrastive Grounding (LCG) is a grounded language learning procedure that leverages visual supervision to improve textual representations.
LCG outperforms standard language-only models in learning efficiency.
It improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization.
arXiv Detail & Related papers (2024-03-21T16:52:01Z) - Language Grounded QFormer for Efficient Vision Language Understanding [25.432918254523344]
We take inspiration from the Query Transformer (QFormer) approach proposed in BLIP-2 models for bridging frozen modalities.
We propose a more efficient method for QFormer-based vision-language alignment.
arXiv Detail & Related papers (2023-11-13T16:30:49Z) - VLIS: Unimodal Language Models Guide Multimodal Language Generation [23.094728230459125]
We introduce Visual-Language models as Importance Sampling weights (VLIS)
It combines the visual conditioning capability of vision-language models with the language understanding of unimodal text-only language models without further training.
VLIS improves vision-language models on diverse tasks, including commonsense understanding and complex text generation.
arXiv Detail & Related papers (2023-10-15T07:58:52Z) - Diffusion Language Models Can Perform Many Tasks with Scaling and
Instruction-Finetuning [56.03057119008865]
We show that scaling diffusion language models can effectively make them strong language learners.
We build competent diffusion language models at scale by first acquiring knowledge from massive data.
Experiments show that scaling diffusion language models consistently improves performance across downstream language tasks.
arXiv Detail & Related papers (2023-08-23T16:01:12Z) - SINC: Self-Supervised In-Context Learning for Vision-Language Tasks [64.44336003123102]
We propose a framework to enable in-context learning in large language models.
A meta-model can learn on self-supervised prompts consisting of tailored demonstrations.
Experiments show that SINC outperforms gradient-based methods in various vision-language tasks.
arXiv Detail & Related papers (2023-07-15T08:33:08Z) - MetaVL: Transferring In-Context Learning Ability From Language Models to
Vision-Language Models [74.89629463600978]
In vision-language domain, most large-scale pre-trained vision-language models do not possess the ability to conduct in-context learning.
In this paper, we study an interesting hypothesis: can we transfer the in-context learning ability from the language domain to the vision domain?
arXiv Detail & Related papers (2023-06-02T07:21:03Z) - Images in Language Space: Exploring the Suitability of Large Language
Models for Vision & Language Tasks [17.97052348690598]
Large language models have demonstrated robust performance on various language tasks using zero-shot or few-shot learning paradigms.
multimodal models that can additionally handle images as input have yet to catch up in size and generality with language-only models.
We make visual information accessible to the language model using separate verbalisation models.
arXiv Detail & Related papers (2023-05-23T07:50:36Z) - A Survey of Large Language Models [81.06947636926638]
Language modeling has been widely studied for language understanding and generation in the past two decades.
Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora.
To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size.
arXiv Detail & Related papers (2023-03-31T17:28:46Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - PaLI: A Jointly-Scaled Multilingual Language-Image Model [110.10710554358455]
PaLI (Pathways Language and Image model) is a model that extends this approach to the joint modeling of language and vision.
We create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages.
arXiv Detail & Related papers (2022-09-14T17:24:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.