Contrastive Semantic Similarity Learning for Image Captioning Evaluation
with Intrinsic Auto-encoder
- URL: http://arxiv.org/abs/2106.15312v1
- Date: Tue, 29 Jun 2021 12:27:05 GMT
- Title: Contrastive Semantic Similarity Learning for Image Captioning Evaluation
with Intrinsic Auto-encoder
- Authors: Chao Zeng, Tiesong Zhao, Sam Kwong
- Abstract summary: Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning.
We develop three progressive model structures to learn the sentence level representations.
Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
- Score: 52.42057181754076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatically evaluating the quality of image captions can be very
challenging since human language is quite flexible that there can be various
expressions for the same meaning. Most of the current captioning metrics rely
on token level matching between candidate caption and the ground truth label
sentences. It usually neglects the sentence-level information. Motivated by the
auto-encoder mechanism and contrastive representation learning advances, we
propose a learning-based metric for image captioning, which we call Intrinsic
Image Captioning Evaluation($I^2CE$). We develop three progressive model
structures to learn the sentence level representations--single branch model,
dual branches model, and triple branches model. Our empirical tests show that
$I^2CE$ trained with dual branches structure achieves better consistency with
human judgments to contemporary image captioning evaluation metrics.
Furthermore, We select several state-of-the-art image captioning models and
test their performances on the MS COCO dataset concerning both contemporary
metrics and the proposed $I^2CE$. Experiment results show that our proposed
method can align well with the scores generated from other contemporary
metrics. On this concern, the proposed metric could serve as a novel indicator
of the intrinsic information between captions, which may be complementary to
the existing ones.
Related papers
- A Novel Evaluation Framework for Image2Text Generation [15.10524860121122]
We propose an evaluation framework rooted in a modern large language model (LLM) capable of image generation.
A high similarity score suggests that the image captioning model has accurately generated textual descriptions.
A low similarity score indicates discrepancies, revealing potential shortcomings in the model's performance.
arXiv Detail & Related papers (2024-08-03T09:27:57Z) - BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues [47.213906345208315]
We propose BRIDGE, a new learnable and reference-free image captioning metric.
Our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores.
arXiv Detail & Related papers (2024-07-29T18:00:17Z) - InfoMetIC: An Informative Metric for Reference-free Image Caption
Evaluation [69.1642316502563]
We propose an Informative Metric for Reference-free Image Caption evaluation (InfoMetIC)
Given an image and a caption, InfoMetIC is able to report incorrect words and unmentioned image regions at fine-grained level.
We also construct a token-level evaluation dataset and demonstrate the effectiveness of InfoMetIC in fine-grained evaluation.
arXiv Detail & Related papers (2023-05-10T09:22:44Z) - Transform, Contrast and Tell: Coherent Entity-Aware Multi-Image
Captioning [0.65268245109828]
Coherent entity-aware multi-image captioning aims to generate coherent captions for neighboring images in a news document.
This paper proposes a coherent entity-aware multi-image captioning model by making use of coherence relationships.
arXiv Detail & Related papers (2023-02-04T07:50:31Z) - Transparent Human Evaluation for Image Captioning [70.03979566548823]
We develop a rubric-based human evaluation protocol for image captioning models.
We show that human-generated captions show substantially higher quality than machine-generated ones.
We hope that this work will promote a more transparent evaluation protocol for image captioning.
arXiv Detail & Related papers (2021-11-17T07:09:59Z) - COSMic: A Coherence-Aware Generation Metric for Image Descriptions [27.41088864449921]
Image metrics have struggled to give accurate learned estimates of the semantic and pragmatic success of text evaluation models.
We present the first learned generation metric for evaluating output captions.
We demonstrate a higher out-efficient for our proposed metric the human judgments for the results of a number of state-of-the-art caption models when compared to several other metrics such as BLEURT and BERT.
arXiv Detail & Related papers (2021-09-11T13:43:36Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.