Boost Image Captioning with Knowledge Reasoning
- URL: http://arxiv.org/abs/2011.00927v1
- Date: Mon, 2 Nov 2020 12:19:46 GMT
- Title: Boost Image Captioning with Knowledge Reasoning
- Authors: Feicheng Huang, Zhixin Li, Haiyang Wei, Canlong Zhang, Huifang Ma
- Abstract summary: We propose word attention to improve the correctness of visual attention when generating sequential descriptions word-by-word.
We introduce a new strategy to inject external knowledge extracted from knowledge graph into the encoder-decoder framework to facilitate meaningful captioning.
- Score: 10.733743535624509
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatically generating a human-like description for a given image is a
potential research in artificial intelligence, which has attracted a great of
attention recently. Most of the existing attention methods explore the mapping
relationships between words in sentence and regions in image, such
unpredictable matching manner sometimes causes inharmonious alignments that may
reduce the quality of generated captions. In this paper, we make our efforts to
reason about more accurate and meaningful captions. We first propose word
attention to improve the correctness of visual attention when generating
sequential descriptions word-by-word. The special word attention emphasizes on
word importance when focusing on different regions of the input image, and
makes full use of the internal annotation knowledge to assist the calculation
of visual attention. Then, in order to reveal those incomprehensible intentions
that cannot be expressed straightforwardly by machines, we introduce a new
strategy to inject external knowledge extracted from knowledge graph into the
encoder-decoder framework to facilitate meaningful captioning. Finally, we
validate our model on two freely available captioning benchmarks: Microsoft
COCO dataset and Flickr30k dataset. The results demonstrate that our approach
achieves state-of-the-art performance and outperforms many of the existing
approaches.
Related papers
- Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Image Captioning based on Feature Refinement and Reflective Decoding [0.0]
This paper introduces an encoder-decoder-based image captioning system.
It extracts spatial and global features for each region in the image using the Faster R-CNN with ResNet-101 as a backbone.
The decoder consists of an attention-based recurrent module and a reflective attention module to enhance the decoder's ability to model long-term sequential dependencies.
arXiv Detail & Related papers (2022-06-16T07:56:28Z) - Guiding Attention using Partial-Order Relationships for Image Captioning [2.620091916172863]
A guided attention network mechanism exploits the relationship between the visual scene and text-descriptions.
A pairwise ranking objective is used for training this embedding space which allows similar images, topics and captions in the shared semantic space.
The experimental results based on MSCOCO dataset shows the competitiveness of our approach.
arXiv Detail & Related papers (2022-04-15T14:22:09Z) - Exploring Explicit and Implicit Visual Relationships for Image
Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Robust Image Captioning [3.20603058999901]
In this study, we leverage the Object Relation using adversarial robust cut algorithm.
Our experimental study represent the promising performance of our proposed method for image captioning.
arXiv Detail & Related papers (2020-12-06T00:33:17Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z) - Exploring and Distilling Cross-Modal Information for Image Captioning [47.62261144821135]
We argue that such understanding requires visual attention to correlated image regions and semantic attention to coherent attributes of interest.
Based on the Transformer, we propose the Global-and-Local Information Exploring-and-Distilling approach that explores and distills the source information in vision and language.
Our Transformer-based model achieves a CIDEr score of 129.3 in offline COCO evaluation on the COCO testing set with remarkable efficiency in terms of accuracy, speed, and parameter budget.
arXiv Detail & Related papers (2020-02-28T07:46:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.