Perturbation CheckLists for Evaluating NLG Evaluation Metrics
- URL: http://arxiv.org/abs/2109.05771v1
- Date: Mon, 13 Sep 2021 08:26:26 GMT
- Title: Perturbation CheckLists for Evaluating NLG Evaluation Metrics
- Authors: Ananya B. Sai, Tanay Dixit, Dev Yashpal Sheth, Sreyas Mohan, Mitesh M.
Khapra
- Abstract summary: Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment of multiple desirable criteria.
Across existing datasets for 6 NLG tasks, we observe that the human evaluation scores on these multiple criteria are often not correlated.
This suggests that the current recipe of proposing new automatic evaluation metrics for NLG is inadequate.
- Score: 16.20764980129339
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural Language Generation (NLG) evaluation is a multifaceted task requiring
assessment of multiple desirable criteria, e.g., fluency, coherency, coverage,
relevance, adequacy, overall quality, etc. Across existing datasets for 6 NLG
tasks, we observe that the human evaluation scores on these multiple criteria
are often not correlated. For example, there is a very low correlation between
human scores on fluency and data coverage for the task of structured data to
text generation. This suggests that the current recipe of proposing new
automatic evaluation metrics for NLG by showing that they correlate well with
scores assigned by humans for a single criteria (overall quality) alone is
inadequate. Indeed, our extensive study involving 25 automatic evaluation
metrics across 6 different tasks and 18 different evaluation criteria shows
that there is no single metric which correlates well with human scores on all
desirable criteria, for most NLG tasks. Given this situation, we propose
CheckLists for better design and evaluation of automatic metrics. We design
templates which target a specific criteria (e.g., coverage) and perturb the
output such that the quality gets affected only along this specific criteria
(e.g., the coverage drops). We show that existing evaluation metrics are not
robust against even such simple perturbations and disagree with scores assigned
by humans to the perturbed output. The proposed templates thus allow for a
fine-grained assessment of automatic evaluation metrics exposing their
limitations and will facilitate better design, analysis and evaluation of such
metrics.
Related papers
- Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation.
We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z) - Automated Metrics for Medical Multi-Document Summarization Disagree with
Human Evaluations [22.563596069176047]
We analyze how automated summarization evaluation metrics correlate with lexical features of generated summaries.
We find that not only do automated metrics fail to capture aspects of quality as assessed by humans, in many cases the system rankings produced by these metrics are anti-correlated with rankings according to human annotators.
arXiv Detail & Related papers (2023-05-23T05:00:59Z) - NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric
Preference Checklist [20.448405494617397]
Task-agnostic metrics, such as Perplexity, BLEU, BERTScore, are cost-effective and highly adaptable to diverse NLG tasks.
Human-aligned metrics (CTC, CtrlEval, UniEval) improves correlation level by incorporating desirable human-like qualities as training objective.
We show that automatic metrics provide a better guidance than human on discriminating system-level performance in Text Summarization and Controlled Generation tasks.
arXiv Detail & Related papers (2023-05-15T11:51:55Z) - Towards a Unified Multi-Dimensional Evaluator for Text Generation [101.47008809623202]
We propose a unified multi-dimensional evaluator UniEval for Natural Language Generation (NLG)
We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions.
Experiments on three typical NLG tasks show that UniEval correlates substantially better with human judgments than existing metrics.
arXiv Detail & Related papers (2022-10-13T17:17:03Z) - Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation
of Story Generation [9.299255585127158]
There is no consensus on which human evaluation criteria to use.
No analysis of how well automatic criteria correlate with them.
HANNA allows us to quantitatively evaluate the correlations of 72 automatic metrics with human criteria.
arXiv Detail & Related papers (2022-08-24T16:35:32Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Perception Score, A Learned Metric for Open-ended Text Generation
Evaluation [62.7690450616204]
We propose a novel and powerful learning-based evaluation metric: Perception Score.
The method measures the overall quality of the generation and scores holistically instead of only focusing on one evaluation criteria, such as word overlapping.
arXiv Detail & Related papers (2020-08-07T10:48:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.