On the Effectiveness of Automated Metrics for Text Generation Systems
- URL: http://arxiv.org/abs/2210.13025v1
- Date: Mon, 24 Oct 2022 08:15:28 GMT
- Title: On the Effectiveness of Automated Metrics for Text Generation Systems
- Authors: Pius von D\"aniken, Jan Deriu, Don Tuggener, Mark Cieliebak
- Abstract summary: We propose a theory that incorporates different sources of uncertainty, such as imperfect automated metrics and insufficiently sized test sets.
The theory has practical applications, such as determining the number of samples needed to reliably distinguish the performance of a set of Text Generation systems.
- Score: 4.661309379738428
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A major challenge in the field of Text Generation is evaluation because we
lack a sound theory that can be leveraged to extract guidelines for evaluation
campaigns. In this work, we propose a first step towards such a theory that
incorporates different sources of uncertainty, such as imperfect automated
metrics and insufficiently sized test sets. The theory has practical
applications, such as determining the number of samples needed to reliably
distinguish the performance of a set of Text Generation systems in a given
setting. We showcase the application of the theory on the WMT 21 and
Spot-The-Bot evaluation data and outline how it can be leveraged to improve the
evaluation protocol regarding the reliability, robustness, and significance of
the evaluation outcome.
Related papers
- RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [69.4501863547618]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.
With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance.
Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - Measuring What Matters: Intrinsic Distance Preservation as a Robust Metric for Embedding Quality [0.0]
This paper introduces the Intrinsic Distance Preservation Evaluation (IDPE) method for assessing embedding quality.
IDPE is based on the preservation of Mahalanobis distances between data points in the original and embedded spaces.
Our results show that IDPE offers a more comprehensive and reliable assessment of embedding quality across various scenarios.
arXiv Detail & Related papers (2024-07-31T13:26:09Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Correction of Errors in Preference Ratings from Automated Metrics for
Text Generation [4.661309379738428]
We propose a statistical model of Text Generation evaluation that accounts for the error-proneness of automated metrics.
We show that our model enables an efficient combination of human and automated ratings to remedy the error-proneness of the automated metrics.
arXiv Detail & Related papers (2023-06-06T17:09:29Z) - From Adversarial Arms Race to Model-centric Evaluation: Motivating a
Unified Automatic Robustness Evaluation Framework [91.94389491920309]
Textual adversarial attacks can discover models' weaknesses by adding semantic-preserved but misleading perturbations to the inputs.
The existing practice of robustness evaluation may exhibit issues of incomprehensive evaluation, impractical evaluation protocol, and invalid adversarial samples.
We set up a unified automatic robustness evaluation framework, shifting towards model-centric evaluation to exploit the advantages of adversarial attacks.
arXiv Detail & Related papers (2023-05-29T14:55:20Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers.
These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult.
We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.