The QCET Taxonomy of Standard Quality Criterion Names and Definitions for the Evaluation of NLP Systems
- URL: http://arxiv.org/abs/2509.22064v1
- Date: Fri, 26 Sep 2025 08:49:03 GMT
- Title: The QCET Taxonomy of Standard Quality Criterion Names and Definitions for the Evaluation of NLP Systems
- Authors: Anya Belz, Simon Mille, Craig Thomson,
- Abstract summary: Not knowing when two evaluations are comparable means we lack the ability to draw reliable conclusions about system quality.<n>We present QCET, which derives a standard set of quality criterion names and definitions from three surveys of evaluations reported in NLP.
- Score: 11.876616474514828
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Prior work has shown that two NLP evaluation experiments that report results for the same quality criterion name (e.g. Fluency) do not necessarily evaluate the same aspect of quality, and the comparability implied by the name can be misleading. Not knowing when two evaluations are comparable in this sense means we currently lack the ability to draw reliable conclusions about system quality on the basis of multiple, independently conducted evaluations. This in turn hampers the ability of the field to progress scientifically as a whole, a pervasive issue in NLP since its beginning (Sparck Jones, 1981). It is hard to see how the issue of unclear comparability can be fully addressed other than by the creation of a standard set of quality criterion names and definitions that the several hundred quality criterion names actually in use in the field can be mapped to, and grounded in. Taking a strictly descriptive approach, the QCET Quality Criteria for Evaluation Taxonomy derives a standard set of quality criterion names and definitions from three surveys of evaluations reported in NLP, and structures them into a hierarchy where each parent node captures common aspects of its child nodes. We present QCET and the resources it consists of, and discuss its three main uses in (i) establishing comparability of existing evaluations, (ii) guiding the design of new evaluations, and (iii) assessing regulatory compliance.
Related papers
- The Validity of Coreference-based Evaluations of Natural Language Understanding [3.505146496638911]
I analyze standard coreference evaluations and show that their design often leads to non-generalizable conclusions.<n>I propose and implement a novel evaluation focused on testing systems' ability to infer the relative plausibility of events.
arXiv Detail & Related papers (2026-02-18T05:49:28Z) - DeepSurvey-Bench: Evaluating Academic Value of Automatically Generated Scientific Survey [53.85391477976017]
DeepSurvey-Bench is a novel benchmark designed to comprehensively evaluate the academic value of generated surveys.<n>We construct a reliable dataset with academic value annotations, and evaluate the deep academic value of the generated surveys.
arXiv Detail & Related papers (2026-01-13T14:42:56Z) - Benchmark^2: Systematic Evaluation of LLM Benchmarks [66.2731798872668]
We propose Benchmark2, a comprehensive framework comprising three complementary metrics.<n>We conduct experiments across 15 benchmarks spanning mathematics, reasoning, and knowledge domains.<n>Our analysis reveals significant quality variations among existing benchmarks and demonstrates that selective benchmark construction can achieve comparable evaluation performance.
arXiv Detail & Related papers (2026-01-07T14:59:03Z) - Expert Preference-based Evaluation of Automated Related Work Generation [54.29459509574242]
We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences.<n>For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs.
arXiv Detail & Related papers (2025-08-11T13:08:07Z) - HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical
Criteria Decomposition [92.17397504834825]
HD-Eval is a framework that iteratively aligns large language models evaluators with human preference.
HD-Eval inherits the essence from the evaluation mindset of human experts and enhances the alignment of LLM-based evaluators.
Extensive experiments on three evaluation domains demonstrate the superiority of HD-Eval in further aligning state-of-the-art evaluators.
arXiv Detail & Related papers (2024-02-24T08:01:32Z) - KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation [69.57018875757622]
We propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility.
Using KPEval, we re-evaluate 23 keyphrase systems and discover that established model comparison results have blind-spots.
arXiv Detail & Related papers (2023-03-27T17:45:38Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - How to Evaluate Explainability? -- A Case for Three Criteria [0.0]
We will provide a multidisciplinary motivation for three quality criteria concerning the information that systems should provide.
Our aim is to fuel the discussion regarding these criteria, such as adequate evaluation methods for them will be conceived.
arXiv Detail & Related papers (2022-09-01T11:22:50Z) - A Meta Survey of Quality Evaluation Criteria in Explanation Methods [0.5801044612920815]
Explanation methods and their evaluation have become a significant issue in explainable artificial intelligence (XAI)
Since the most accurate AI models are opaque with low transparency and comprehensibility, explanations are essential for bias detection and control of uncertainty.
There are a plethora of criteria to choose from when evaluating explanation method quality.
arXiv Detail & Related papers (2022-03-25T22:24:21Z) - Perturbation CheckLists for Evaluating NLG Evaluation Metrics [16.20764980129339]
Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment of multiple desirable criteria.
Across existing datasets for 6 NLG tasks, we observe that the human evaluation scores on these multiple criteria are often not correlated.
This suggests that the current recipe of proposing new automatic evaluation metrics for NLG is inadequate.
arXiv Detail & Related papers (2021-09-13T08:26:26Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - How Trustworthy are Performance Evaluations for Basic Vision Tasks? [46.0590176230731]
This paper examines performance evaluation criteria for basic vision tasks involving sets of objects namely, object detection, instance-level segmentation and multi-object tracking.
The rankings of algorithms by an existing criterion can fluctuate with different choices of parameters, making their evaluations unreliable.
This work suggests a notion of trustworthiness for performance criteria, which requires (i) robustness to parameters for reliability, (ii) contextual meaningfulness in sanity tests, and (iii) consistency with mathematical requirements such as the metric properties.
arXiv Detail & Related papers (2020-08-08T14:21:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.