Integrating Image Captioning with Rule-based Entity Masking
- URL: http://arxiv.org/abs/2007.11690v1
- Date: Wed, 22 Jul 2020 21:27:12 GMT
- Title: Integrating Image Captioning with Rule-based Entity Masking
- Authors: Aditya Mogadala and Xiaoyu Shen and Dietrich Klakow
- Abstract summary: We propose a novel framework for the image captioning with an explicit object (e.g., knowledge graph entity) selection process.
The model first explicitly selects which local entities to include in the caption according to a human-interpretable mask, then generate proper captions by attending to selected entities.
- Score: 23.79124007406315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given an image, generating its natural language description (i.e., caption)
is a well studied problem. Approaches proposed to address this problem usually
rely on image features that are difficult to interpret. Particularly, these
image features are subdivided into global and local features, where global
features are extracted from the global representation of the image, while local
features are extracted from the objects detected locally in an image. Although,
local features extract rich visual information from the image, existing models
generate captions in a blackbox manner and humans have difficulty interpreting
which local objects the caption is aimed to represent. Hence in this paper, we
propose a novel framework for the image captioning with an explicit object
(e.g., knowledge graph entity) selection process while still maintaining its
end-to-end training ability. The model first explicitly selects which local
entities to include in the caption according to a human-interpretable mask,
then generate proper captions by attending to selected entities. Experiments
conducted on the MSCOCO dataset demonstrate that our method achieves good
performance in terms of the caption quality and diversity with a more
interpretable generating process than previous counterparts.
Related papers
- Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - CapText: Large Language Model-based Caption Generation From Image
Context and Description [0.0]
We propose and evaluate a new approach to generate captions from textual descriptions and context alone.
Our approach outperforms current state-of-the-art image-text alignment models like OSCAR-VinVL on this task on the CIDEr metric.
arXiv Detail & Related papers (2023-06-01T02:40:44Z) - Language Guided Local Infiltration for Interactive Image Retrieval [12.324893780690918]
Interactive Image Retrieval (IIR) aims to retrieve images that are generally similar to the reference image but under requested text modification.
We propose a Language Guided Local Infiltration (LGLI) system, which fully utilizes the text information and penetrates text features into image features.
Our method outperforms most state-of-the-art IIR approaches.
arXiv Detail & Related papers (2023-04-16T10:33:08Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - Generating image captions with external encyclopedic knowledge [1.452875650827562]
We create an end-to-end caption generation system that makes extensive use of image-specific encyclopedic data.
Our approach includes a novel way of using image location to identify relevant open-domain facts in an external knowledge base.
Our system is trained and tested on a new dataset with naturally produced knowledge-rich captions.
arXiv Detail & Related papers (2022-10-10T16:09:21Z) - Image-Specific Information Suppression and Implicit Local Alignment for
Text-based Person Search [61.24539128142504]
Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text.
Most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities.
We propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels.
arXiv Detail & Related papers (2022-08-30T16:14:18Z) - CapOnImage: Context-driven Dense-Captioning on Image [13.604173177437536]
We introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information.
We propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations.
Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects.
arXiv Detail & Related papers (2022-04-27T14:40:31Z) - Attribute Prototype Network for Any-Shot Learning [113.50220968583353]
We argue that an image representation with integrated attribute localization ability would be beneficial for any-shot, i.e. zero-shot and few-shot, image classification tasks.
We propose a novel representation learning framework that jointly learns global and local features using only class-level attributes.
arXiv Detail & Related papers (2022-04-04T02:25:40Z) - Context-Aware Image Inpainting with Learned Semantic Priors [100.99543516733341]
We introduce pretext tasks that are semantically meaningful to estimating the missing contents.
We propose a context-aware image inpainting model, which adaptively integrates global semantics and local features.
arXiv Detail & Related papers (2021-06-14T08:09:43Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Exploring Explicit and Implicit Visual Relationships for Image
Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.