JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures
for Image Captioning Models
- URL: http://arxiv.org/abs/2311.04192v1
- Date: Tue, 7 Nov 2023 18:33:34 GMT
- Title: JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures
for Image Captioning Models
- Authors: Yuiga Wada, Kanta Kaneda, Komei Sugiura
- Abstract summary: We propose an automatic evaluation metric called JaSPICE, which evaluates Japanese captions based on scene graphs.
We conducted experiments employing 10 image captioning models trained on STAIR Captions and PFN-PIC and constructed the Shichimi dataset, which contains 103,170 human evaluations.
- Score: 1.534667887016089
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image captioning studies heavily rely on automatic evaluation metrics such as
BLEU and METEOR. However, such n-gram-based metrics have been shown to
correlate poorly with human evaluation, leading to the proposal of alternative
metrics such as SPICE for English; however, no equivalent metrics have been
established for other languages. Therefore, in this study, we propose an
automatic evaluation metric called JaSPICE, which evaluates Japanese captions
based on scene graphs. The proposed method generates a scene graph from
dependencies and the predicate-argument structure, and extends the graph using
synonyms. We conducted experiments employing 10 image captioning models trained
on STAIR Captions and PFN-PIC and constructed the Shichimi dataset, which
contains 103,170 human evaluations. The results showed that our metric
outperformed the baseline metrics for the correlation coefficient with the
human evaluation.
Related papers
- HICEScore: A Hierarchical Metric for Image Captioning Evaluation [10.88292081473071]
We propose a novel reference-free metric for image captioning evaluation, dubbed Hierarchical Image Captioning Evaluation Score (HICE-S)
By detecting local visual regions and textual phrases, HICE-S builds an interpretable hierarchical scoring mechanism.
Our proposed metric achieves the SOTA performance on several benchmarks, outperforming existing reference-free metrics.
arXiv Detail & Related papers (2024-07-26T08:24:30Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - Rethinking Evaluation Metrics of Open-Vocabulary Segmentaion [78.76867266561537]
The evaluation process still heavily relies on closed-set metrics without considering the similarity between predicted and ground truth categories.
To tackle this issue, we first survey eleven similarity measurements between two categorical words.
We designed novel evaluation metrics, namely Open mIoU, Open AP, and Open PQ, tailored for three open-vocabulary segmentation tasks.
arXiv Detail & Related papers (2023-11-06T18:59:01Z) - Positive-Augmented Contrastive Learning for Image and Video Captioning
Evaluation [47.40949434032489]
We propose a new contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S)
PAC-S unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data.
Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos.
arXiv Detail & Related papers (2023-03-21T18:03:14Z) - Social Biases in Automatic Evaluation Metrics for NLG [53.76118154594404]
We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics.
We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
arXiv Detail & Related papers (2022-10-17T08:55:26Z) - On the Intrinsic and Extrinsic Fairness Evaluation Metrics for
Contextualized Language Representations [74.70957445600936]
Multiple metrics have been introduced to measure fairness in various natural language processing tasks.
These metrics can be roughly categorized into two categories: 1) emphextrinsic metrics for evaluating fairness in downstream applications and 2) emphintrinsic metrics for estimating fairness in upstream language representation models.
arXiv Detail & Related papers (2022-03-25T22:17:43Z) - COSMic: A Coherence-Aware Generation Metric for Image Descriptions [27.41088864449921]
Image metrics have struggled to give accurate learned estimates of the semantic and pragmatic success of text evaluation models.
We present the first learned generation metric for evaluating output captions.
We demonstrate a higher out-efficient for our proposed metric the human judgments for the results of a number of state-of-the-art caption models when compared to several other metrics such as BLEURT and BERT.
arXiv Detail & Related papers (2021-09-11T13:43:36Z) - LCEval: Learned Composite Metric for Caption Evaluation [37.2313913156926]
We propose a neural network-based learned metric to improve the caption-level caption evaluation.
This paper investigates the relationship between different linguistic features and the caption-level correlation of the learned metrics.
Our proposed metric not only outperforms the existing metrics in terms of caption-level correlation but it also shows a strong system-level correlation against human assessments.
arXiv Detail & Related papers (2020-12-24T06:38:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.