Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot
Image Captioning
- URL: http://arxiv.org/abs/2302.04858v2
- Date: Sun, 22 Oct 2023 04:18:00 GMT
- Title: Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot
Image Captioning
- Authors: Zhuolin Yang, Wei Ping, Zihan Liu, Vijay Korthikanti, Weili Nie, De-An
Huang, Linxi Fan, Zhiding Yu, Shiyi Lan, Bo Li, Ming-Yu Liu, Yuke Zhu,
Mohammad Shoeybi, Bryan Catanzaro, Chaowei Xiao, Anima Anandkumar
- Abstract summary: We introduce a Retrieval-augmented Visual Language Model, Re-ViLM, built upon the Flamingo.
By storing certain knowledge explicitly in the external database, our approach reduces the number of model parameters.
We demonstrate that Re-ViLM significantly boosts performance for image-to-text generation tasks.
- Score: 153.98100182439165
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Augmenting pretrained language models (LMs) with a vision encoder (e.g.,
Flamingo) has obtained the state-of-the-art results in image-to-text
generation. However, these models store all the knowledge within their
parameters, thus often requiring enormous model parameters to model the
abundant visual concepts and very rich textual descriptions. Additionally, they
are inefficient in incorporating new data, requiring a computational-expensive
fine-tuning process. In this work, we introduce a Retrieval-augmented Visual
Language Model, Re-ViLM, built upon the Flamingo, that supports retrieving the
relevant knowledge from the external database for zero and in-context few-shot
image-to-text generations. By storing certain knowledge explicitly in the
external database, our approach reduces the number of model parameters and can
easily accommodate new data during evaluation by simply updating the database.
We also construct an interleaved image and text data that facilitates
in-context few-shot learning capabilities. We demonstrate that Re-ViLM
significantly boosts performance for image-to-text generation tasks, especially
for zero-shot and few-shot generation in out-of-domain settings with 4 times
less parameters compared with baseline methods.
Related papers
- Improving the Efficiency of Visually Augmented Language Models [5.948051066733892]
This paper shows that explicit images are not necessary to visually augment an LM.
Instead, we use visually-grounded text representations obtained from the well-known CLIP multimodal system.
We show that BLIND-VALM performs on par with VALM for Visual Language Understanding (VLU), Natural Language Understanding (NLU) and Language Modeling tasks.
arXiv Detail & Related papers (2024-09-17T13:02:19Z) - ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling [35.098725056881655]
Large vision language models (LVLMs) have shown unprecedented visual reasoning capabilities.
The generated text often suffers from inaccurate grounding in the visual input, resulting in errors such as hallucination of nonexistent scene elements.
We introduce a novel framework, ViGoR, that utilizes fine-grained reward modeling to significantly enhance the visual grounding of LVLMs over pre-trained baselines.
arXiv Detail & Related papers (2024-02-09T01:00:14Z) - COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
Pre-Training [119.03392147066093]
Recent autoregressive vision-language models have excelled in few-shot text generation tasks but face challenges in alignment tasks.
We introduce the contrastive loss into text generation models, partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components.
To bridge this gap, this work introduces VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions.
arXiv Detail & Related papers (2024-01-01T18:58:42Z) - EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension [24.335348817838216]
Large language models (LLMs)-based image captioning has the capability of describing objects not explicitly observed in training data.
We introduce a highly effective retrieval-augmented image captioning method that prompts LLMs with object names retrieved from External Visual--name memory (EVCap)
Our model, which was trained only on the COCO dataset, can adapt to out-of-domain without requiring additional fine-tuning or re-training.
arXiv Detail & Related papers (2023-11-27T14:51:37Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Generative Negative Text Replay for Continual Vision-Language
Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently.
Massive data are usually collected in a streaming fashion.
We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z) - Re-Imagen: Retrieval-Augmented Text-to-Image Generator [58.60472701831404]
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
arXiv Detail & Related papers (2022-09-29T00:57:28Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.