Related papers: NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

URL: http://arxiv.org/abs/2305.08566v4
Date: Fri, 26 May 2023 07:30:35 GMT
Title: NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist
Authors: Iftitahu Ni'mah and Meng Fang and Vlado Menkovski and Mykola Pechenizkiy
Abstract summary: Task-agnostic metrics, such as Perplexity, BLEU, BERTScore, are cost-effective and highly adaptable to diverse NLG tasks. Human-aligned metrics (CTC, CtrlEval, UniEval) improves correlation level by incorporating desirable human-like qualities as training objective. We show that automatic metrics provide a better guidance than human on discriminating system-level performance in Text Summarization and Controlled Generation tasks.
Score: 20.448405494617397
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this study, we analyze automatic evaluation metrics for Natural Language Generation (NLG), specifically task-agnostic metrics and human-aligned metrics. Task-agnostic metrics, such as Perplexity, BLEU, BERTScore, are cost-effective and highly adaptable to diverse NLG tasks, yet they have a weak correlation with human. Human-aligned metrics (CTC, CtrlEval, UniEval) improves correlation level by incorporating desirable human-like qualities as training objective. However, their effectiveness at discerning system-level performance and quality of system outputs remain unclear. We present metric preference checklist as a framework to assess the effectiveness of automatic metrics in three NLG tasks: Text Summarization, Dialogue Response Generation, and Controlled Generation. Our proposed framework provides access: (i) for verifying whether automatic metrics are faithful to human preference, regardless of their correlation level to human; and (ii) for inspecting the strengths and limitations of NLG systems via pairwise evaluation. We show that automatic metrics provide a better guidance than human on discriminating system-level performance in Text Summarization and Controlled Generation tasks. We also show that multi-aspect human-aligned metric (UniEval) is not necessarily dominant over single-aspect human-aligned metrics (CTC, CtrlEval) and task-agnostic metrics (BLEU, BERTScore), particularly in Controlled Generation tasks.

Related papers

Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics. This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings. We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z)
Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality. Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z)
Evaluation Metrics of Language Generation Models for Synthetic Traffic Generation Tasks [22.629816738693254]
We show that common NLG metrics, like BLEU, are not suitable for evaluating Synthetic Traffic Generation (STG) We propose and evaluate several metrics designed to compare the generated traffic to the distribution of real user texts.
arXiv Detail & Related papers (2023-11-21T11:26:26Z)
What is the Best Automated Metric for Text to Motion Generation? [19.71712698183703]
There is growing interest in generating skeleton-based human motions from natural language descriptions. Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments. This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better.
arXiv Detail & Related papers (2023-09-19T01:59:54Z)
HAUSER: Towards Holistic and Automatic Evaluation of Simile Generation [18.049566239050762]
Proper evaluation metrics are like a beacon guiding the research of simile generation (SG) To address the issues, we establish HA, a holistic and automatic evaluation system for the SG task, which consists of five criteria from three perspectives and automatic metrics for each criterion. Our metrics are significantly more correlated with human ratings from each perspective compared with prior automatic metrics.
arXiv Detail & Related papers (2023-06-13T06:06:01Z)
ICE-Score: Instructing Large Language Models to Evaluate Code [7.556444391696562]
We propose textttICE-Score, a new evaluation metric via instructing large language models for code assessments. Our metric addresses the limitations of existing approaches by achieving superior correlations with functional correctness and human preferences. Our results demonstrate that our metric surpasses state-of-the-art metrics for code generation.
arXiv Detail & Related papers (2023-04-27T16:38:17Z)
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation. We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings. SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z)
The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics. Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z)
Perturbation CheckLists for Evaluating NLG Evaluation Metrics [16.20764980129339]
Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment of multiple desirable criteria. Across existing datasets for 6 NLG tasks, we observe that the human evaluation scores on these multiple criteria are often not correlated. This suggests that the current recipe of proposing new automatic evaluation metrics for NLG is inadequate.
arXiv Detail & Related papers (2021-09-13T08:26:26Z)
GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment. We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.