Related papers: Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention

Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention

URL: http://arxiv.org/abs/2504.02496v1
Date: Thu, 03 Apr 2025 11:19:51 GMT
Title: Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention
Authors: Jiuniu Wang, Wenjia Xu, Qingzhong Wang, Antoni B. Chan,
Abstract summary: Group-based Differential Distinctive Captioning Method.<n>Group-based Differential Memory Attention (GDMA) module.<n>New evaluation metric, the Distinctive Word Rate (DisWordRate)
Score: 62.246950834745796
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Recent advances in image captioning have focused on enhancing accuracy by substantially increasing the dataset and model size. While conventional captioning models exhibit high performance on established metrics such as BLEU, CIDEr, and SPICE, the capability of captions to distinguish the target image from other similar images is under-explored. To generate distinctive captions, a few pioneers employed contrastive learning or re-weighted the ground-truth captions. However, these approaches often overlook the relationships among objects in a similar image group (e.g., items or properties within the same album or fine-grained events). In this paper, we introduce a novel approach to enhance the distinctiveness of image captions, namely Group-based Differential Distinctive Captioning Method, which visually compares each image with other images in one similar group and highlights the uniqueness of each image. In particular, we introduce a Group-based Differential Memory Attention (GDMA) module, designed to identify and emphasize object features in an image that are uniquely distinguishable within its image group, i.e., those exhibiting low similarity with objects in other images. This mechanism ensures that such unique object features are prioritized during caption generation for the image, thereby enhancing the distinctiveness of the resulting captions. To further refine this process, we select distinctive words from the ground-truth captions to guide both the language decoder and the GDMA module. Additionally, we propose a new evaluation metric, the Distinctive Word Rate (DisWordRate), to quantitatively assess caption distinctiveness. Quantitative results indicate that the proposed method significantly improves the distinctiveness of several baseline models, and achieves state-of-the-art performance on distinctiveness while not excessively sacrificing accuracy...

Related papers

Distinctive Image Captioning via CLIP Guided Group Optimization [13.102953452346297]
In this paper, we focus on generating the distinctive captions that can distinguish the target image from other similar images. We introduce a series of metrics that use large-scale vision-language pre-training model CLIP to quantify the distinctiveness. We propose a simple and effective training strategy which trains the model by comparing target image with similar image group and optimizing the group embedding gap.
arXiv Detail & Related papers (2022-08-08T16:37:01Z)
Guiding Attention using Partial-Order Relationships for Image Captioning [2.620091916172863]
A guided attention network mechanism exploits the relationship between the visual scene and text-descriptions. A pairwise ranking objective is used for training this embedding space which allows similar images, topics and captions in the shared semantic space. The experimental results based on MSCOCO dataset shows the competitiveness of our approach.
arXiv Detail & Related papers (2022-04-15T14:22:09Z)
On Distinctive Image Captioning via Comparing and Reweighting [52.3731631461383]
In this paper, we aim to improve the distinctiveness of image captions via comparing and reweighting with a set of similar images. Our metric reveals that the human annotations of each image in the MSCOCO dataset are not equivalent based on distinctiveness. In contrast, previous works normally treat the human annotations equally during training, which could be a reason for generating less distinctive captions.
arXiv Detail & Related papers (2022-04-08T08:59:23Z)
Group-based Distinctive Image Captioning with Memory Attention [45.763534774116856]
Group-based Distinctive Captioning Model (GdisCap) improves the distinctiveness of image captions. New evaluation metric, distinctive word rate (DisWordRate) is proposed to measure the distinctiveness of captions.
arXiv Detail & Related papers (2021-08-20T12:46:36Z)
Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE) Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)
Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets [52.3731631461383]
We aim to improve the distinctiveness of image captions through training with sets of similar images. Our metric shows that the human annotations of each image are not equivalent based on distinctiveness.
arXiv Detail & Related papers (2020-07-14T07:40:39Z)
Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.