Compare and Reweight: Distinctive Image Captioning Using Similar Images
Sets
- URL: http://arxiv.org/abs/2007.06877v1
- Date: Tue, 14 Jul 2020 07:40:39 GMT
- Title: Compare and Reweight: Distinctive Image Captioning Using Similar Images
Sets
- Authors: Jiuniu Wang, Wenjia Xu, Qingzhong Wang, Antoni B. Chan
- Abstract summary: We aim to improve the distinctiveness of image captions through training with sets of similar images.
Our metric shows that the human annotations of each image are not equivalent based on distinctiveness.
- Score: 52.3731631461383
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A wide range of image captioning models has been developed, achieving
significant improvement based on popular metrics, such as BLEU, CIDEr, and
SPICE. However, although the generated captions can accurately describe the
image, they are generic for similar images and lack distinctiveness, i.e.,
cannot properly describe the uniqueness of each image. In this paper, we aim to
improve the distinctiveness of image captions through training with sets of
similar images. First, we propose a distinctiveness metric -- between-set CIDEr
(CIDErBtw) to evaluate the distinctiveness of a caption with respect to those
of similar images. Our metric shows that the human annotations of each image
are not equivalent based on distinctiveness. Thus we propose several new
training strategies to encourage the distinctiveness of the generated caption
for each image, which are based on using CIDErBtw in a weighted loss function
or as a reinforcement learning reward. Finally, extensive experiments are
conducted, showing that our proposed approach significantly improves both
distinctiveness (as measured by CIDErBtw and retrieval metrics) and accuracy
(e.g., as measured by CIDEr) for a wide variety of image captioning baselines.
These results are further confirmed through a user study.
Related papers
- Fluent and Accurate Image Captioning with a Self-Trained Reward Model [47.213906345208315]
We propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives.
Our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness.
arXiv Detail & Related papers (2024-08-29T18:00:03Z) - Improving Generalization of Image Captioning with Unsupervised Prompt
Learning [63.26197177542422]
Generalization of Image Captioning (GeneIC) learns a domain-specific prompt vector for the target domain without requiring annotated data.
GeneIC aligns visual and language modalities with a pre-trained Contrastive Language-Image Pre-Training (CLIP) model.
arXiv Detail & Related papers (2023-08-05T12:27:01Z) - Distinctive Image Captioning via CLIP Guided Group Optimization [13.102953452346297]
In this paper, we focus on generating the distinctive captions that can distinguish the target image from other similar images.
We introduce a series of metrics that use large-scale vision-language pre-training model CLIP to quantify the distinctiveness.
We propose a simple and effective training strategy which trains the model by comparing target image with similar image group and optimizing the group embedding gap.
arXiv Detail & Related papers (2022-08-08T16:37:01Z) - On Distinctive Image Captioning via Comparing and Reweighting [52.3731631461383]
In this paper, we aim to improve the distinctiveness of image captions via comparing and reweighting with a set of similar images.
Our metric reveals that the human annotations of each image in the MSCOCO dataset are not equivalent based on distinctiveness.
In contrast, previous works normally treat the human annotations equally during training, which could be a reason for generating less distinctive captions.
arXiv Detail & Related papers (2022-04-08T08:59:23Z) - Group-based Distinctive Image Captioning with Memory Attention [45.763534774116856]
Group-based Distinctive Captioning Model (GdisCap) improves the distinctiveness of image captions.
New evaluation metric, distinctive word rate (DisWordRate) is proposed to measure the distinctiveness of captions.
arXiv Detail & Related papers (2021-08-20T12:46:36Z) - Contrastive Semantic Similarity Learning for Image Captioning Evaluation
with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning.
We develop three progressive model structures to learn the sentence level representations.
Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Towards Unique and Informative Captioning of Images [40.036350846970706]
We analyze both modern captioning systems and evaluation metrics.
We design a new metric (SPICE) by introducing a notion of uniqueness over the concepts generated in a caption.
We show that SPICE-U is better correlated with human judgements compared to SPICE, and effectively captures notions of diversity and descriptiveness.
arXiv Detail & Related papers (2020-09-08T19:01:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.