On the Effectiveness of Automated Metrics for Text Generation Systems
- URL: http://arxiv.org/abs/2210.13025v1
- Date: Mon, 24 Oct 2022 08:15:28 GMT
- Title: On the Effectiveness of Automated Metrics for Text Generation Systems
- Authors: Pius von D\"aniken, Jan Deriu, Don Tuggener, Mark Cieliebak
- Abstract summary: We propose a theory that incorporates different sources of uncertainty, such as imperfect automated metrics and insufficiently sized test sets.
The theory has practical applications, such as determining the number of samples needed to reliably distinguish the performance of a set of Text Generation systems.
- Score: 4.661309379738428
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A major challenge in the field of Text Generation is evaluation because we
lack a sound theory that can be leveraged to extract guidelines for evaluation
campaigns. In this work, we propose a first step towards such a theory that
incorporates different sources of uncertainty, such as imperfect automated
metrics and insufficiently sized test sets. The theory has practical
applications, such as determining the number of samples needed to reliably
distinguish the performance of a set of Text Generation systems in a given
setting. We showcase the application of the theory on the WMT 21 and
Spot-The-Bot evaluation data and outline how it can be leveraged to improve the
evaluation protocol regarding the reliability, robustness, and significance of
the evaluation outcome.
Related papers
- Check-Eval: A Checklist-based Approach for Evaluating Text Quality [3.4069627091757178]
Check-Eval is an evaluation framework for large language models (LLMs)
Check-Eval can be employed as both a reference-free and reference-dependent evaluation method.
We validate Check-Eval on two benchmark datasets: Portuguese Legal Semantic Textual Similarity and SummEval.
arXiv Detail & Related papers (2024-07-19T17:14:16Z) - Correction of Errors in Preference Ratings from Automated Metrics for
Text Generation [4.661309379738428]
We propose a statistical model of Text Generation evaluation that accounts for the error-proneness of automated metrics.
We show that our model enables an efficient combination of human and automated ratings to remedy the error-proneness of the automated metrics.
arXiv Detail & Related papers (2023-06-06T17:09:29Z) - From Adversarial Arms Race to Model-centric Evaluation: Motivating a
Unified Automatic Robustness Evaluation Framework [91.94389491920309]
Textual adversarial attacks can discover models' weaknesses by adding semantic-preserved but misleading perturbations to the inputs.
The existing practice of robustness evaluation may exhibit issues of incomprehensive evaluation, impractical evaluation protocol, and invalid adversarial samples.
We set up a unified automatic robustness evaluation framework, shifting towards model-centric evaluation to exploit the advantages of adversarial attacks.
arXiv Detail & Related papers (2023-05-29T14:55:20Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers.
These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult.
We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z) - On the Limitations of Reference-Free Evaluations of Generated Text [64.81682222169113]
We show that reference-free metrics are inherently biased and limited in their ability to evaluate generated text.
We argue that they should not be used to measure progress on tasks like machine translation or summarization.
arXiv Detail & Related papers (2022-10-22T22:12:06Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Perception Score, A Learned Metric for Open-ended Text Generation
Evaluation [62.7690450616204]
We propose a novel and powerful learning-based evaluation metric: Perception Score.
The method measures the overall quality of the generation and scores holistically instead of only focusing on one evaluation criteria, such as word overlapping.
arXiv Detail & Related papers (2020-08-07T10:48:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.