Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination
- URL: http://arxiv.org/abs/2210.12261v1
- Date: Fri, 21 Oct 2022 21:33:10 GMT
- Title: Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination
- Authors: Yue Yang, Wenlin Yao, Hongming Zhang, Xiaoyang Wang, Dong Yu, Jianshu
Chen
- Abstract summary: We develop a novel approach, Z-LaVI, to endow language models with visual imagination capabilities.
We leverage two complementary types of ''imaginations'': (i) recalling existing images through retrieval and (ii) synthesizing nonexistent images via text-to-image generation.
Jointly exploiting the language inputs and the imagination, a pretrained vision-language model eventually composes a zero-shot solution to the original language tasks.
- Score: 57.49336064527538
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale pretrained language models have made significant advances in
solving downstream language understanding tasks. However, they generally suffer
from reporting bias, the phenomenon describing the lack of explicit commonsense
knowledge in written text, e.g., ''an orange is orange''. To overcome this
limitation, we develop a novel approach, Z-LaVI, to endow language models with
visual imagination capabilities. Specifically, we leverage two complementary
types of ''imaginations'': (i) recalling existing images through retrieval and
(ii) synthesizing nonexistent images via text-to-image generation. Jointly
exploiting the language inputs and the imagination, a pretrained
vision-language model (e.g., CLIP) eventually composes a zero-shot solution to
the original language tasks. Notably, fueling language models with imagination
can effectively leverage visual knowledge to solve plain language tasks. In
consequence, Z-LaVI consistently improves the zero-shot performance of existing
language models across a diverse set of language tasks.
Related papers
- Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling [47.7950860342515]
LexiContrastive Grounding (LCG) is a grounded language learning procedure that leverages visual supervision to improve textual representations.
LCG outperforms standard language-only models in learning efficiency.
It improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization.
arXiv Detail & Related papers (2024-03-21T16:52:01Z) - Visually Grounded Language Learning: a review of language games,
datasets, tasks, and models [60.2604624857992]
Many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality.
In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field.
arXiv Detail & Related papers (2023-12-05T02:17:29Z) - VLIS: Unimodal Language Models Guide Multimodal Language Generation [23.094728230459125]
We introduce Visual-Language models as Importance Sampling weights (VLIS)
It combines the visual conditioning capability of vision-language models with the language understanding of unimodal text-only language models without further training.
VLIS improves vision-language models on diverse tasks, including commonsense understanding and complex text generation.
arXiv Detail & Related papers (2023-10-15T07:58:52Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - Visually-Situated Natural Language Understanding with Contrastive
Reading Model and Frozen Large Language Models [24.456117679941816]
Contrastive Reading Model (Cream) is a novel neural architecture designed to enhance the language-image understanding capability of Large Language Models (LLMs)
Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants.
arXiv Detail & Related papers (2023-05-24T11:59:13Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.