Group-based Distinctive Image Captioning with Memory Attention
- URL: http://arxiv.org/abs/2108.09151v1
- Date: Fri, 20 Aug 2021 12:46:36 GMT
- Title: Group-based Distinctive Image Captioning with Memory Attention
- Authors: Jiuniu Wang, Wenjia Xu, Qingzhong Wang, Antoni B. Chan
- Abstract summary: Group-based Distinctive Captioning Model (GdisCap) improves the distinctiveness of image captions.
New evaluation metric, distinctive word rate (DisWordRate) is proposed to measure the distinctiveness of captions.
- Score: 45.763534774116856
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Describing images using natural language is widely known as image captioning,
which has made consistent progress due to the development of computer vision
and natural language generation techniques. Though conventional captioning
models achieve high accuracy based on popular metrics, i.e., BLEU, CIDEr, and
SPICE, the ability of captions to distinguish the target image from other
similar images is under-explored. To generate distinctive captions, a few
pioneers employ contrastive learning or re-weighted the ground-truth captions,
which focuses on one single input image. However, the relationships between
objects in a similar image group (e.g., items or properties within the same
album or fine-grained events) are neglected. In this paper, we improve the
distinctiveness of image captions using a Group-based Distinctive Captioning
Model (GdisCap), which compares each image with other images in one similar
group and highlights the uniqueness of each image. In particular, we propose a
group-based memory attention (GMA) module, which stores object features that
are unique among the image group (i.e., with low similarity to objects in other
images). These unique object features are highlighted when generating captions,
resulting in more distinctive captions. Furthermore, the distinctive words in
the ground-truth captions are selected to supervise the language decoder and
GMA. Finally, we propose a new evaluation metric, distinctive word rate
(DisWordRate) to measure the distinctiveness of captions. Quantitative results
indicate that the proposed method significantly improves the distinctiveness of
several baseline models, and achieves the state-of-the-art performance on both
accuracy and distinctiveness. Results of a user study agree with the
quantitative evaluation and demonstrate the rationality of the new metric
DisWordRate.
Related papers
- Evaluating authenticity and quality of image captions via sentiment and semantic analyses [0.0]
Deep learning relies heavily on huge amounts of labelled data for tasks such as natural language processing and computer vision.
In image-to-text or image-to-image pipelines, opinion (sentiment) may be inadvertently learned by a model from human-generated image captions.
This study proposes an evaluation method focused on sentiment and semantic richness.
arXiv Detail & Related papers (2024-09-14T23:50:23Z) - Improving Generalization of Image Captioning with Unsupervised Prompt
Learning [63.26197177542422]
Generalization of Image Captioning (GeneIC) learns a domain-specific prompt vector for the target domain without requiring annotated data.
GeneIC aligns visual and language modalities with a pre-trained Contrastive Language-Image Pre-Training (CLIP) model.
arXiv Detail & Related papers (2023-08-05T12:27:01Z) - Distinctive Image Captioning via CLIP Guided Group Optimization [13.102953452346297]
In this paper, we focus on generating the distinctive captions that can distinguish the target image from other similar images.
We introduce a series of metrics that use large-scale vision-language pre-training model CLIP to quantify the distinctiveness.
We propose a simple and effective training strategy which trains the model by comparing target image with similar image group and optimizing the group embedding gap.
arXiv Detail & Related papers (2022-08-08T16:37:01Z) - Image Captioning based on Feature Refinement and Reflective Decoding [0.0]
This paper introduces an encoder-decoder-based image captioning system.
It extracts spatial and global features for each region in the image using the Faster R-CNN with ResNet-101 as a backbone.
The decoder consists of an attention-based recurrent module and a reflective attention module to enhance the decoder's ability to model long-term sequential dependencies.
arXiv Detail & Related papers (2022-06-16T07:56:28Z) - On Distinctive Image Captioning via Comparing and Reweighting [52.3731631461383]
In this paper, we aim to improve the distinctiveness of image captions via comparing and reweighting with a set of similar images.
Our metric reveals that the human annotations of each image in the MSCOCO dataset are not equivalent based on distinctiveness.
In contrast, previous works normally treat the human annotations equally during training, which could be a reason for generating less distinctive captions.
arXiv Detail & Related papers (2022-04-08T08:59:23Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Compare and Reweight: Distinctive Image Captioning Using Similar Images
Sets [52.3731631461383]
We aim to improve the distinctiveness of image captions through training with sets of similar images.
Our metric shows that the human annotations of each image are not equivalent based on distinctiveness.
arXiv Detail & Related papers (2020-07-14T07:40:39Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.