Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation
- URL: http://arxiv.org/abs/2410.16834v2
- Date: Sat, 25 Jan 2025 14:42:57 GMT
- Title: Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation
- Authors: Mingqi Gao, Xinyu Hu, Li Lin, Xiaojun Wan,
- Abstract summary: This paper analyzes 12 common correlation measures using a large amount of real-world data from six widely-used NLG evaluation datasets and 32 evaluation metrics.
We find that the measure using global grouping and Pearson correlation coefficient exhibits the best performance in both discriminative power and ranking consistency.
- Score: 39.5140467914755
- License:
- Abstract: The correlation between NLG automatic evaluation metrics and human evaluation is often regarded as a critical criterion for assessing the capability of an evaluation metric. However, different grouping methods and correlation coefficients result in various types of correlation measures used in meta-evaluation. In specific evaluation scenarios, prior work often directly follows conventional measure settings, but the characteristics and differences between these measures have not gotten sufficient attention. Therefore, this paper analyzes 12 common correlation measures using a large amount of real-world data from six widely-used NLG evaluation datasets and 32 evaluation metrics, revealing that different measures indeed impact the meta-evaluation results. Furthermore, we propose three perspectives that reflect the capability of meta-evaluation: discriminative power, ranking consistency, and sensitivity to score granularity. We find that the measure using global grouping and Pearson correlation coefficient exhibits the best performance in both discriminative power and ranking consistency. Besides, the measures using system-level grouping or Kendall correlation are the least sensitive to score granularity.
Related papers
- A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability [36.83105355430611]
We propose a dual-perspective NLG meta-evaluation framework that focuses on different evaluation capabilities.
We also introduce a method of automatically constructing the corresponding benchmarks without requiring new human annotations.
arXiv Detail & Related papers (2025-02-17T17:22:49Z) - As Biased as You Measure: Methodological Pitfalls of Bias Evaluations in Speaker Verification Research [15.722009470067974]
We investigate how measurement impacts the outcomes of bias evaluations.
We show that bias evaluations are strongly influenced by base metrics that measure performance.
Based on our findings, we recommend the use of ratio-based bias measures.
arXiv Detail & Related papers (2024-08-24T16:04:51Z) - Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue
Evaluation [68.59356746305255]
We propose a novel model-agnostic approach to measure the turn-level interaction between the system and the user.
Our approach significantly improves the correlation with human judgment compared with existing evaluation systems.
arXiv Detail & Related papers (2023-06-27T06:58:03Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.