Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation
Metrics using Measurement Theory
- URL: http://arxiv.org/abs/2305.14889v2
- Date: Mon, 23 Oct 2023 01:02:48 GMT
- Title: Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation
Metrics using Measurement Theory
- Authors: Ziang Xiao, Susu Zhang, Vivian Lai, Q. Vera Liao
- Abstract summary: MetricEval is a framework for conceptualizing and evaluating the reliability and validity of NLG evaluation metrics.
We aim to promote the design, evaluation, and interpretation of valid and reliable metrics to advance robust and effective NLG models.
- Score: 46.06645793520894
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We address a fundamental challenge in Natural Language Generation (NLG) model
evaluation -- the design and evaluation of evaluation metrics. Recognizing the
limitations of existing automatic metrics and noises from how current human
evaluation was conducted, we propose MetricEval, a framework informed by
measurement theory, the foundation of educational test design, for
conceptualizing and evaluating the reliability and validity of NLG evaluation
metrics. The framework formalizes the source of measurement error and offers
statistical tools for evaluating evaluation metrics based on empirical data.
With our framework, one can quantify the uncertainty of the metrics to better
interpret the result. To exemplify the use of our framework in practice, we
analyzed a set of evaluation metrics for summarization and identified issues
related to conflated validity structure in human-eval and reliability in
LLM-based metrics. Through MetricEval, we aim to promote the design,
evaluation, and interpretation of valid and reliable metrics to advance robust
and effective NLG models.
Related papers
- Towards Evaluation for Real-World LLM Unlearning [16.31710864838019]
We propose a new metric called Distribution Correction-based Unlearning Evaluation (DCUE)<n>It identifies core tokens and corrects distributional biases in their confidence scores using a validation set.<n>Results are quantified using the Kolmogorov-Smirnov test.
arXiv Detail & Related papers (2025-08-02T11:32:41Z) - Reranking-based Generation for Unbiased Perspective Summarization [10.71668103641552]
We develop a test set for benchmarking metric reliability using human annotations.<n>We show that traditional metrics underperform compared to language model-based metrics, which prove to be strong evaluators.<n>Our findings aim to contribute to the reliable evaluation and development of perspective summarization methods.
arXiv Detail & Related papers (2025-06-19T00:01:43Z) - Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy [52.261323452286554]
We introduce a method for contextual metric meta-evaluation by comparing the local metric accuracy of evaluation metrics.
Across translation, speech recognition, and ranking tasks, we demonstrate that the local metric accuracies vary both in absolute value and relative effectiveness as we shift across evaluation contexts.
arXiv Detail & Related papers (2025-03-25T16:42:25Z) - Evaluating Step-by-step Reasoning Traces: A Survey [3.895864050325129]
We propose a taxonomy of evaluation criteria with four top-level categories (groundedness, validity, coherence, and utility)
We then categorize metrics based on their implementations, survey which metrics are used for assessing each criterion, and explore whether evaluator models can transfer across different criteria.
arXiv Detail & Related papers (2025-02-17T19:58:31Z) - A Critical Look at Meta-evaluating Summarisation Evaluation Metrics [11.541368732416506]
We argue that the time is ripe to build more diverse benchmarks that enable the development of more robust evaluation metrics.
We call for research focusing on user-centric quality dimensions that consider the generated summary's communicative goal.
arXiv Detail & Related papers (2024-09-29T01:30:13Z) - Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - Can We Trust the Performance Evaluation of Uncertainty Estimation Methods in Text Summarization? [28.30641958347868]
We introduce a comprehensive UE-TS benchmark incorporating 31 NLG metrics across four dimensions.
The benchmark evaluates the uncertainty estimation capabilities of two large language models and one pre-trained language model on three datasets.
Our findings emphasize the importance of considering multiple uncorrelated NLG metrics and diverse uncertainty estimation methods.
arXiv Detail & Related papers (2024-06-25T04:41:17Z) - Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework [2.4861619769660637]
We propose an estimands framework adapted from international clinical trials guidelines.
This framework provides a systematic structure for inference and reporting in evaluations.
We demonstrate how the framework can help uncover underlying issues, their causes, and potential solutions.
arXiv Detail & Related papers (2024-06-14T18:47:37Z) - From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications [26.857056013032263]
evaluating large language models (LLMs) is fundamental, particularly in the context of practical applications.
Our study shifts the focus from model-centered to human-centered evaluation in the context of AI-powered writing assistance applications.
arXiv Detail & Related papers (2024-04-10T15:46:08Z) - Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - Learning Evaluation Models from Large Language Models for Sequence Generation [61.8421748792555]
We propose a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development.
Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data.
arXiv Detail & Related papers (2023-08-08T16:41:16Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.