Related papers: Generating image captions with external encyclopedic knowledge

Generating image captions with external encyclopedic knowledge

URL: http://arxiv.org/abs/2210.04806v1
Date: Mon, 10 Oct 2022 16:09:21 GMT
Title: Generating image captions with external encyclopedic knowledge
Authors: Sofia Nikiforova, Tejaswini Deoskar, Denis Paperno, Yoad Winter
Abstract summary: We create an end-to-end caption generation system that makes extensive use of image-specific encyclopedic data. Our approach includes a novel way of using image location to identify relevant open-domain facts in an external knowledge base. Our system is trained and tested on a new dataset with naturally produced knowledge-rich captions.
Score: 1.452875650827562
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Accurately reporting what objects are depicted in an image is largely a solved problem in automatic caption generation. The next big challenge on the way to truly humanlike captioning is being able to incorporate the context of the image and related real world knowledge. We tackle this challenge by creating an end-to-end caption generation system that makes extensive use of image-specific encyclopedic data. Our approach includes a novel way of using image location to identify relevant open-domain facts in an external knowledge base, with their subsequent integration into the captioning pipeline at both the encoding and decoding stages. Our system is trained and tested on a new dataset with naturally produced knowledge-rich captions, and achieves significant improvements over multiple baselines. We empirically demonstrate that our approach is effective for generating contextualized captions with encyclopedic knowledge that is both factually accurate and relevant to the image.

Related papers

Beam-Guided Knowledge Replay for Knowledge-Rich Image Captioning using Vision-Language Model [0.8747606955991707]
KRCapVLM is a knowledge replay-based novel image captioning framework.<n>We incorporate beam search decoding to generate more diverse and coherent captions.<n>Our proposed model demonstrates clear improvements in both the accuracy of knowledge recognition and the overall quality of generated captions.
arXiv Detail & Related papers (2025-05-29T11:33:36Z)
Altogether: Image Captioning via Re-aligning Alt-text [118.29542883805405]
We study a principled approach Altogether based on the key idea to edit and re-align existing alt-texts associated with the images. To generate training data, we perform human annotation where annotators start with the existing alt-text and re-align it to the image content in multiple rounds. We train a captioner on this data that generalizes the process of re-aligning alt-texts at scale.
arXiv Detail & Related papers (2024-10-22T17:59:57Z)
Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process. Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities. We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z)
What Makes for Good Image Captions? [50.48589893443939]
Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans. We introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information.
arXiv Detail & Related papers (2024-05-01T12:49:57Z)
CapText: Large Language Model-based Caption Generation From Image Context and Description [0.0]
We propose and evaluate a new approach to generate captions from textual descriptions and context alone. Our approach outperforms current state-of-the-art image-text alignment models like OSCAR-VinVL on this task on the CIDEr metric.
arXiv Detail & Related papers (2023-06-01T02:40:44Z)
Word-Level Fine-Grained Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story with a global consistency across dynamic scenes and characters. Current works still struggle with output images' quality and consistency, and rely on additional semantic information or auxiliary captioning networks. We first introduce a new sentence representation, which incorporates word information from all story sentences to mitigate the inconsistency problem. Then, we propose a new discriminator with fusion features to improve image quality and story consistency.
arXiv Detail & Related papers (2022-08-03T21:01:47Z)
Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens. Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z)
CapOnImage: Context-driven Dense-Captioning on Image [13.604173177437536]
We introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information. We propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations. Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects.
arXiv Detail & Related papers (2022-04-27T14:40:31Z)
Towards Accurate Text-based Image Captioning with Content Diversity Exploration [46.061291298616354]
Text-based image captioning (TextCap) which aims to read and reason images with texts is crucial for a machine to understand a detailed and complex scene environment. Existing methods attempt to extend the traditional image captioning methods to solve this task, which focus on describing the overall scene of images by one global caption. This is infeasible because the complex text and visual information cannot be described well within one caption.
arXiv Detail & Related papers (2021-04-23T08:57:47Z)
Integrating Image Captioning with Rule-based Entity Masking [23.79124007406315]
We propose a novel framework for the image captioning with an explicit object (e.g., knowledge graph entity) selection process. The model first explicitly selects which local entities to include in the caption according to a human-interpretable mask, then generate proper captions by attending to selected entities.
arXiv Detail & Related papers (2020-07-22T21:27:12Z)
Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.