Evaluatology: The Science and Engineering of Evaluation
- URL: http://arxiv.org/abs/2404.00021v1
- Date: Tue, 19 Mar 2024 13:38:26 GMT
- Title: Evaluatology: The Science and Engineering of Evaluation
- Authors: Jianfeng Zhan, Lei Wang, Wanling Gao, Hongxiao Li, Chenxi Wang, Yunyou Huang, Yatao Li, Zhengxin Yang, Guoxin Kang, Chunjie Luo, Hainan Ye, Shaopeng Dai, Zhifei Zhang,
- Abstract summary: This article aims to formally introduce the discipline of evaluatology, which encompasses the science and engineering of evaluation.
We propose a universal framework for evaluation, encompassing concepts, terminologies, theories, and methodologies that can be applied across various disciplines.
- Score: 11.997673313601423
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Evaluation is a crucial aspect of human existence and plays a vital role in various fields. However, it is often approached in an empirical and ad-hoc manner, lacking consensus on universal concepts, terminologies, theories, and methodologies. This lack of agreement has significant repercussions. This article aims to formally introduce the discipline of evaluatology, which encompasses the science and engineering of evaluation. We propose a universal framework for evaluation, encompassing concepts, terminologies, theories, and methodologies that can be applied across various disciplines. Our research reveals that the essence of evaluation lies in conducting experiments that intentionally apply a well-defined evaluation condition to diverse subjects and infer the impact of different subjects by measuring and/or testing. Derived from the essence of evaluation, we propose five axioms focusing on key aspects of evaluation outcomes as the foundational evaluation theory. These axioms serve as the bedrock upon which we build universal evaluation theories and methodologies. When evaluating a single subject, it is crucial to create evaluation conditions with different levels of equivalency. By applying these conditions to diverse subjects, we can establish reference evaluation models. These models allow us to alter a single independent variable at a time while keeping all other variables as controls. When evaluating complex scenarios, the key lies in establishing a series of evaluation models that maintain transitivity. Building upon the science of evaluation, we propose a formal definition of a benchmark as a simplified and sampled evaluation condition that guarantees different levels of equivalency. This concept serves as the cornerstone for a universal benchmark-based engineering approach to evaluation across various disciplines, which we refer to as benchmarkology.
Related papers
- Good Idea or Not, Representation of LLM Could Tell [86.36317971482755]
We focus on idea assessment, which aims to leverage the knowledge of large language models to assess the merit of scientific ideas.
We release a benchmark dataset from nearly four thousand manuscript papers with full texts, meticulously designed to train and evaluate the performance of different approaches to this task.
Our findings suggest that the representations of large language models hold more potential in quantifying the value of ideas than their generative outputs.
arXiv Detail & Related papers (2024-09-07T02:07:22Z) - StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation [46.59416831869014]
We propose a novel evaluation framework referred to as StructEval.
Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a structured assessment across multiple cognitive levels and critical concepts.
Experiments on three widely-used benchmarks demonstrate that StructEval serves as a reliable tool for resisting the risk of data contamination.
arXiv Detail & Related papers (2024-08-06T16:28:30Z) - ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking.
We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert.
We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z) - Multi-Criteria Comparison as a Method of Advancing Knowledge-Guided Machine Learning [1.6574413179773761]
This paper describes a generalizable model evaluation method that can be adapted to evaluate AI/ML models.
The method evaluates a group of candidate models of varying type and structure across multiple scientific, theoretic, and practical criteria.
arXiv Detail & Related papers (2024-03-18T14:50:48Z) - Evaluation in Neural Style Transfer: A Review [0.7614628596146599]
We provide an in-depth analysis of existing evaluation techniques, identify the inconsistencies and limitations of current evaluation methods, and give recommendations for standardized evaluation practices.
We believe that the development of a robust evaluation framework will not only enable more meaningful and fairer comparisons but will also enhance the comprehension and interpretation of research findings in the field.
arXiv Detail & Related papers (2024-01-30T15:45:30Z) - F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z) - Evaluating Agents using Social Choice Theory [21.26784305333596]
We argue that many general evaluation problems can be viewed through the lens of voting theory.
Each task is interpreted as a separate voter, which requires only ordinal rankings or pairwise comparisons of agents to produce an overall evaluation.
These evaluations are interpretable and flexible, while avoiding many of the problems currently facing cross-task evaluation.
arXiv Detail & Related papers (2023-12-05T20:40:37Z) - KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation [69.57018875757622]
We propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility.
Using KPEval, we re-evaluate 23 keyphrase systems and discover that established model comparison results have blind-spots.
arXiv Detail & Related papers (2023-03-27T17:45:38Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - Evaluation Gaps in Machine Learning Practice [13.963766987258161]
In practice, evaluations of machine learning models frequently focus on a narrow range of decontextualized predictive behaviours.
We examine the evaluation gaps between the idealized breadth of evaluation concerns and the observed narrow focus of actual evaluations.
By studying these properties, we demonstrate the machine learning discipline's implicit assumption of a range of commitments which have normative impacts.
arXiv Detail & Related papers (2022-05-11T04:00:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.