Generating image captions with external encyclopedic knowledge
- URL: http://arxiv.org/abs/2210.04806v1
- Date: Mon, 10 Oct 2022 16:09:21 GMT
- Title: Generating image captions with external encyclopedic knowledge
- Authors: Sofia Nikiforova, Tejaswini Deoskar, Denis Paperno, Yoad Winter
- Abstract summary: We create an end-to-end caption generation system that makes extensive use of image-specific encyclopedic data.
Our approach includes a novel way of using image location to identify relevant open-domain facts in an external knowledge base.
Our system is trained and tested on a new dataset with naturally produced knowledge-rich captions.
- Score: 1.452875650827562
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Accurately reporting what objects are depicted in an image is largely a
solved problem in automatic caption generation. The next big challenge on the
way to truly humanlike captioning is being able to incorporate the context of
the image and related real world knowledge. We tackle this challenge by
creating an end-to-end caption generation system that makes extensive use of
image-specific encyclopedic data. Our approach includes a novel way of using
image location to identify relevant open-domain facts in an external knowledge
base, with their subsequent integration into the captioning pipeline at both
the encoding and decoding stages. Our system is trained and tested on a new
dataset with naturally produced knowledge-rich captions, and achieves
significant improvements over multiple baselines. We empirically demonstrate
that our approach is effective for generating contextualized captions with
encyclopedic knowledge that is both factually accurate and relevant to the
image.
Related papers
- Altogether: Image Captioning via Re-aligning Alt-text [118.29542883805405]
We study a principled approach Altogether based on the key idea to edit and re-align existing alt-texts associated with the images.
To generate training data, we perform human annotation where annotators start with the existing alt-text and re-align it to the image content in multiple rounds.
We train a captioner on this data that generalizes the process of re-aligning alt-texts at scale.
arXiv Detail & Related papers (2024-10-22T17:59:57Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - What Makes for Good Image Captions? [50.48589893443939]
Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans.
We introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information.
arXiv Detail & Related papers (2024-05-01T12:49:57Z) - CapText: Large Language Model-based Caption Generation From Image
Context and Description [0.0]
We propose and evaluate a new approach to generate captions from textual descriptions and context alone.
Our approach outperforms current state-of-the-art image-text alignment models like OSCAR-VinVL on this task on the CIDEr metric.
arXiv Detail & Related papers (2023-06-01T02:40:44Z) - Word-Level Fine-Grained Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story with a global consistency across dynamic scenes and characters.
Current works still struggle with output images' quality and consistency, and rely on additional semantic information or auxiliary captioning networks.
We first introduce a new sentence representation, which incorporates word information from all story sentences to mitigate the inconsistency problem.
Then, we propose a new discriminator with fusion features to improve image quality and story consistency.
arXiv Detail & Related papers (2022-08-03T21:01:47Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - CapOnImage: Context-driven Dense-Captioning on Image [13.604173177437536]
We introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information.
We propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations.
Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects.
arXiv Detail & Related papers (2022-04-27T14:40:31Z) - Towards Accurate Text-based Image Captioning with Content Diversity
Exploration [46.061291298616354]
Text-based image captioning (TextCap) which aims to read and reason images with texts is crucial for a machine to understand a detailed and complex scene environment.
Existing methods attempt to extend the traditional image captioning methods to solve this task, which focus on describing the overall scene of images by one global caption.
This is infeasible because the complex text and visual information cannot be described well within one caption.
arXiv Detail & Related papers (2021-04-23T08:57:47Z) - Integrating Image Captioning with Rule-based Entity Masking [23.79124007406315]
We propose a novel framework for the image captioning with an explicit object (e.g., knowledge graph entity) selection process.
The model first explicitly selects which local entities to include in the caption according to a human-interpretable mask, then generate proper captions by attending to selected entities.
arXiv Detail & Related papers (2020-07-22T21:27:12Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.