BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation
- URL: http://arxiv.org/abs/2110.09147v1
- Date: Mon, 18 Oct 2021 10:03:19 GMT
- Title: BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation
- Authors: Thomas Scialom and Felix Hill
- Abstract summary: Natural language processing (NLP) systems are increasingly trained to generate open-ended text.
Different metrics have different strengths and biases, and reflect human intuitions better on some tasks than others.
Here, we describe the Benchmark to Evaluate Automatic Metrics (BEAMetrics) to make research into new metrics itself easier to evaluate.
- Score: 16.81712151903078
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural language processing (NLP) systems are increasingly trained to
generate open-ended text rather than classifying between responses. This makes
research on evaluation metrics for generated language -- functions that score
system output given the context and/or human reference responses -- of critical
importance. However, different metrics have different strengths and biases, and
reflect human intuitions better on some tasks than others. There is currently
no simple, unified way to compare, analyse or evaluate metrics across a
representative set of tasks. Here, we describe the Benchmark to Evaluate
Automatic Metrics (BEAMetrics), a resource to make research into new metrics
itself easier to evaluate. BEAMetrics users can quickly compare existing and
new metrics with human judgements across a diverse set of tasks, quality
dimensions (fluency vs. coherence vs. informativeness etc), and languages. As
generation experts might predict, BEAMetrics reveals stark task-dependent
differences between existing metrics, and consistently poor performance on
tasks with complex answer spaces or high reliance on general knowledge. While
this analysis highlights a critical issue facing current research practice,
BEAMetrics also contribute to its resolution by facilitating research into
better metrics -- particularly those that can account for the complex
interaction between context and general knowledge inherent to many modern NLP
applications. BEAMetrics is available under the MIT License:
https://github.com/ThomasScialom/BEAMetrics
Related papers
- Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - Evaluation Metrics of Language Generation Models for Synthetic Traffic
Generation Tasks [22.629816738693254]
We show that common NLG metrics, like BLEU, are not suitable for evaluating Synthetic Traffic Generation (STG)
We propose and evaluate several metrics designed to compare the generated traffic to the distribution of real user texts.
arXiv Detail & Related papers (2023-11-21T11:26:26Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation.
SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality.
We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z) - ICE-Score: Instructing Large Language Models to Evaluate Code [7.556444391696562]
We propose textttICE-Score, a new evaluation metric via instructing large language models for code assessments.
Our metric addresses the limitations of existing approaches by achieving superior correlations with functional correctness and human preferences.
Our results demonstrate that our metric surpasses state-of-the-art metrics for code generation.
arXiv Detail & Related papers (2023-04-27T16:38:17Z) - The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z) - InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation [27.129551973093008]
InfoLM is a family of untrained metrics that can be viewed as a string-based metric.
This family of metrics also makes use of information measures allowing the adaptation of InfoLM to various evaluation criteria.
arXiv Detail & Related papers (2021-12-02T20:09:29Z) - OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics [53.779709191191685]
We propose OpenMEVA, a benchmark for evaluating open-ended story generation metrics.
OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics.
We observe that existing metrics have poor correlation with human judgments, fail to recognize discourse-level incoherence, and lack inferential knowledge.
arXiv Detail & Related papers (2021-05-19T04:45:07Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - GRUEN for Evaluating Linguistic Quality of Generated Text [17.234442722611803]
We propose GRUEN for evaluating Grammaticality, non-Redundancy, focUs, structure and coherENce of generated text.
GRUEN utilizes a BERT-based model and a class of syntactic, semantic, and contextual features to examine the system output.
arXiv Detail & Related papers (2020-10-06T05:59:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.