Related papers: LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting

LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting

URL: http://arxiv.org/abs/2305.19821v1
Date: Wed, 31 May 2023 13:03:17 GMT
Title: LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting
Authors: Rita Ramos, Bruno Martins, Desmond Elliott
Abstract summary: We propose LMCap, an image-blind few-shot multilingual captioning model that works by prompting a language model with retrieved captions. Experiments on the XM3600 dataset of geographically diverse images show that our model is competitive with fully-supervised multilingual captioning models.
Score: 15.266569206458648
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multilingual image captioning has recently been tackled by training with large-scale machine translated data, which is an expensive, noisy, and time-consuming process. Without requiring any multilingual caption data, we propose LMCap, an image-blind few-shot multilingual captioning model that works by prompting a language model with retrieved captions. Specifically, instead of following the standard encoder-decoder paradigm, given an image, LMCap first retrieves the captions of similar images using a multilingual CLIP encoder. These captions are then combined into a prompt for an XGLM decoder, in order to generate captions in the desired language. In other words, the generation model does not directly process the image, instead processing retrieved captions. Experiments on the XM3600 dataset of geographically diverse images show that our model is competitive with fully-supervised multilingual captioning models, without requiring any supervised training on any captioning data.

Related papers

CONCAP: Seeing Beyond English with Concepts Retrieval-Augmented Captioning [7.439550425786999]
We introduce CONCAP, a multilingual image captioning model that integrates retrieved captions with image-specific concepts.<n>Experiments on the XM3600 dataset indicate that CONCAP enables strong performance on low- and mid-resource languages.
arXiv Detail & Related papers (2025-07-27T21:00:02Z)
Towards Automatic Satellite Images Captions Generation Using Large Language Models [0.5439020425819]
We propose Automatic Remote Sensing Image Captioning (ARSIC) to automatically collect captions for remote sensing images. We also present a benchmark model that adapts the pre-trained generative image2text model (GIT) to generate high-quality captions for remote-sensing images.
arXiv Detail & Related papers (2023-10-17T16:45:47Z)
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale. We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z)
Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training. We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z)
Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z)
DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training [73.74291217502928]
We propose a simple framework, named DeCap, for zero-shot captioning. We introduce a lightweight visual-aware language decoder. We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input.
arXiv Detail & Related papers (2023-03-06T11:02:47Z)
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning [93.6842670770983]
Vid2Seq is a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale. We show that it is possible to leverage unlabeled narrated videos for dense video captioning, by reformulating sentence boundaries of transcribed speech as pseudo event boundaries. The resulting Vid2Seq model pretrained on the YT-Temporal-1B dataset improves the state of the art on a variety of dense video captioning benchmarks.
arXiv Detail & Related papers (2023-02-27T19:53:49Z)
Retrieval-augmented Image Captioning [15.266569206458648]
We present a new approach to image captioning that generates sentences given the input image and a set of captions retrieved from a datastore. The encoder in our model jointly processes the image and retrieved captions using a pretrained V&L BERT. Our work contributes towards using pretrained V&L encoders for generative tasks, instead of standard classification tasks.
arXiv Detail & Related papers (2023-02-16T12:54:13Z)
Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning. Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision. Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z)
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners [167.0346394848718]
We propose VidIL, a few-shot Video-language Learner via Image and Language models. We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases. We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
arXiv Detail & Related papers (2022-05-22T05:18:27Z)
CapOnImage: Context-driven Dense-Captioning on Image [13.604173177437536]
We introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information. We propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations. Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects.
arXiv Detail & Related papers (2022-04-27T14:40:31Z)
Fusion Models for Improved Visual Captioning [18.016295296424413]
This paper proposes a generic multimodal model fusion framework for caption generation and emendation. We employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM) with a visual captioning model, viz. Show, Attend, and Tell. Our caption emendation experiments on three benchmark image captioning datasets, viz. Flickr8k, Flickr30k, and MSCOCO, show improvements over the baseline.
arXiv Detail & Related papers (2020-10-28T21:55:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.