LCEval: Learned Composite Metric for Caption Evaluation
- URL: http://arxiv.org/abs/2012.13136v1
- Date: Thu, 24 Dec 2020 06:38:24 GMT
- Title: LCEval: Learned Composite Metric for Caption Evaluation
- Authors: Naeha Sharif and Lyndon White and Mohammed Bennamoun and Wei Liu and
Syed Afaq Ali Shah
- Abstract summary: We propose a neural network-based learned metric to improve the caption-level caption evaluation.
This paper investigates the relationship between different linguistic features and the caption-level correlation of the learned metrics.
Our proposed metric not only outperforms the existing metrics in terms of caption-level correlation but it also shows a strong system-level correlation against human assessments.
- Score: 37.2313913156926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic evaluation metrics hold a fundamental importance in the development
and fine-grained analysis of captioning systems. While current evaluation
metrics tend to achieve an acceptable correlation with human judgements at the
system level, they fail to do so at the caption level. In this work, we propose
a neural network-based learned metric to improve the caption-level caption
evaluation. To get a deeper insight into the parameters which impact a learned
metrics performance, this paper investigates the relationship between different
linguistic features and the caption-level correlation of the learned metrics.
We also compare metrics trained with different training examples to measure the
variations in their evaluation. Moreover, we perform a robustness analysis,
which highlights the sensitivity of learned and handcrafted metrics to various
sentence perturbations. Our empirical analysis shows that our proposed metric
not only outperforms the existing metrics in terms of caption-level correlation
but it also shows a strong system-level correlation against human assessments.
Related papers
- Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics.
This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings.
We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z) - Towards Fine-Grained Citation Evaluation in Generated Text: A Comparative Analysis of Faithfulness Metrics [22.041561519672456]
Large language models (LLMs) often produce unsupported or unverifiable content, known as "hallucinations"
We propose a comparative evaluation framework that assesses the metric effectiveness in distinguishing citations between three-category support levels.
Our results show no single metric consistently excels across all evaluations, revealing the complexity of assessing fine-grained support.
arXiv Detail & Related papers (2024-06-21T15:57:24Z) - Exploring the Correlation between Human and Machine Evaluation of Simultaneous Speech Translation [0.9576327614980397]
This study aims to assess the reliability of automatic metrics in evaluating simultaneous interpretations by analyzing their correlation with human evaluations.
As a benchmark we use human assessments performed by language experts, and evaluate how well sentence embeddings and Large Language Models correlate with them.
The results suggest GPT models, particularly GPT-3.5 with direct prompting, demonstrate the strongest correlation with human judgment in terms of semantic similarity between source and target texts.
arXiv Detail & Related papers (2024-06-14T14:47:19Z) - JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures
for Image Captioning Models [1.534667887016089]
We propose an automatic evaluation metric called JaSPICE, which evaluates Japanese captions based on scene graphs.
We conducted experiments employing 10 image captioning models trained on STAIR Captions and PFN-PIC and constructed the Shichimi dataset, which contains 103,170 human evaluations.
arXiv Detail & Related papers (2023-11-07T18:33:34Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Measuring the Measuring Tools: An Automatic Evaluation of Semantic
Metrics for Text Corpora [5.254054636427663]
The ability to compare the semantic similarity between text corpora is important in a variety of natural language processing applications.
We propose a set of automatic and interpretable measures for assessing the characteristics of corpus-level semantic similarity metrics.
arXiv Detail & Related papers (2022-11-29T14:47:07Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - On the Intrinsic and Extrinsic Fairness Evaluation Metrics for
Contextualized Language Representations [74.70957445600936]
Multiple metrics have been introduced to measure fairness in various natural language processing tasks.
These metrics can be roughly categorized into two categories: 1) emphextrinsic metrics for evaluating fairness in downstream applications and 2) emphintrinsic metrics for estimating fairness in upstream language representation models.
arXiv Detail & Related papers (2022-03-25T22:17:43Z) - SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption
Evaluation via Typicality Analysis [20.026835809227283]
We introduce "typicality", a new formulation of evaluation rooted in information theory.
We show how these decomposed dimensions of semantics and fluency provide greater system-level insight into captioner differences.
Our proposed metrics along with their combination, SMURF, achieve state-of-the-art correlation with human judgment when compared with other rule-based evaluation metrics.
arXiv Detail & Related papers (2021-06-02T19:58:20Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.