Retrieval-augmented Image Captioning
- URL: http://arxiv.org/abs/2302.08268v1
- Date: Thu, 16 Feb 2023 12:54:13 GMT
- Title: Retrieval-augmented Image Captioning
- Authors: Rita Ramos, Desmond Elliott, Bruno Martins
- Abstract summary: We present a new approach to image captioning that generates sentences given the input image and a set of captions retrieved from a datastore.
The encoder in our model jointly processes the image and retrieved captions using a pretrained V&L BERT.
Our work contributes towards using pretrained V&L encoders for generative tasks, instead of standard classification tasks.
- Score: 15.266569206458648
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inspired by retrieval-augmented language generation and pretrained Vision and
Language (V&L) encoders, we present a new approach to image captioning that
generates sentences given the input image and a set of captions retrieved from
a datastore, as opposed to the image alone. The encoder in our model jointly
processes the image and retrieved captions using a pretrained V&L BERT, while
the decoder attends to the multimodal encoder representations, benefiting from
the extra textual evidence from the retrieved captions. Experimental results on
the COCO dataset show that image captioning can be effectively formulated from
this new perspective. Our model, named EXTRA, benefits from using captions
retrieved from the training dataset, and it can also benefit from using an
external dataset without the need for retraining. Ablation studies show that
retrieving a sufficient number of captions (e.g., k=5) can improve captioning
quality. Our work contributes towards using pretrained V&L encoders for
generative tasks, instead of standard classification tasks.
Related papers
- Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented
Language Model Prompting [15.266569206458648]
We propose LMCap, an image-blind few-shot multilingual captioning model that works by prompting a language model with retrieved captions.
Experiments on the XM3600 dataset of geographically diverse images show that our model is competitive with fully-supervised multilingual captioning models.
arXiv Detail & Related papers (2023-05-31T13:03:17Z) - DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only
Training [73.74291217502928]
We propose a simple framework, named DeCap, for zero-shot captioning.
We introduce a lightweight visual-aware language decoder.
We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input.
arXiv Detail & Related papers (2023-03-06T11:02:47Z) - Paraphrasing Is All You Need for Novel Object Captioning [126.66301869607656]
Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training.
We present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which wouldally optimize the output captions via paraphrasing.
arXiv Detail & Related papers (2022-09-25T22:56:04Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning [128.6138588412508]
This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations.
Our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects.
arXiv Detail & Related papers (2020-09-28T23:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.