Related papers: Hierarchical Evaluation Framework: Best Practices for Human Evaluation

Hierarchical Evaluation Framework: Best Practices for Human Evaluation

URL: http://arxiv.org/abs/2310.01917v2
Date: Thu, 12 Oct 2023 07:59:56 GMT
Title: Hierarchical Evaluation Framework: Best Practices for Human Evaluation
Authors: Iva Bojic, Jessica Chen, Si Yuan Chang, Qi Chwen Ong, Shafiq Joty, Josip Car
Abstract summary: The absence of widely accepted human evaluation metrics in NLP hampers fair comparisons among different systems and the establishment of universal assessment standards. We develop our own hierarchical evaluation framework to provide a more comprehensive representation of the NLP system's performance. In future work, we will investigate the potential time-saving benefits of our proposed framework for evaluators assessing NLP systems.
Score: 17.91641890651225
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human evaluation plays a crucial role in Natural Language Processing (NLP) as it assesses the quality and relevance of developed systems, thereby facilitating their enhancement. However, the absence of widely accepted human evaluation metrics in NLP hampers fair comparisons among different systems and the establishment of universal assessment standards. Through an extensive analysis of existing literature on human evaluation metrics, we identified several gaps in NLP evaluation methodologies. These gaps served as motivation for developing our own hierarchical evaluation framework. The proposed framework offers notable advantages, particularly in providing a more comprehensive representation of the NLP system's performance. We applied this framework to evaluate the developed Machine Reading Comprehension system, which was utilized within a human-AI symbiosis model. The results highlighted the associations between the quality of inputs and outputs, underscoring the necessity to evaluate both components rather than solely focusing on outputs. In future work, we will investigate the potential time-saving benefits of our proposed framework for evaluators assessing NLP systems.

Related papers

AGACCI : Affiliated Grading Agents for Criteria-Centric Interface in Educational Coding Contexts [0.6050976240234864]
We introduce AGACCI, a multi-agent system that distributes specialized evaluation roles across collaborative agents.<n>AGACCI outperforms a single GPT-based baseline in terms of rubric and feedback accuracy, relevance, consistency, and coherence.
arXiv Detail & Related papers (2025-07-07T15:50:46Z)
A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability [36.83105355430611]
We propose a dual-perspective NLG meta-evaluation framework that focuses on different evaluation capabilities. We also introduce a method of automatically constructing the corresponding benchmarks without requiring new human annotations.
arXiv Detail & Related papers (2025-02-17T17:22:49Z)
HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF) In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination. We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z)
Unveiling Context-Aware Criteria in Self-Assessing LLMs [28.156979106994537]
We propose a novel Self-Assessing LLM framework that integrates Context-Aware Criteria (SALC) with dynamic knowledge tailored to each evaluation instance. Empirical evaluations demonstrate that our approach significantly outperforms existing baseline evaluation frameworks. Our method also exhibits a improvement in LC Win-Rate in AlpacaEval2 leaderboard up to a 12% when employed for preference data generation.
arXiv Detail & Related papers (2024-10-28T21:18:49Z)
Evaluating Human-AI Collaboration: A Review and Methodological Framework [4.41358655687435]
The use of artificial intelligence (AI) in working environments with individuals, known as Human-AI Collaboration (HAIC), has become essential. evaluating HAIC's effectiveness remains challenging due to the complex interaction of components involved. This paper provides a detailed analysis of existing HAIC evaluation approaches and develops a fresh paradigm for more effectively evaluating these systems.
arXiv Detail & Related papers (2024-07-09T12:52:22Z)
Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition [70.60872754129832]
First NeurIPS competition on unlearning sought to stimulate the development of novel algorithms. Nearly 1,200 teams from across the world participated. We analyze top solutions and delve into discussions on benchmarking unlearning.
arXiv Detail & Related papers (2024-06-13T12:58:00Z)
ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z)
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators [48.54465599914978]
Large Language Models (LLMs) have demonstrated promising capabilities in assessing the quality of generated natural language. LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments. We introduce Pairwise-preference Search (PairS), an uncertainty-guided search method that employs LLMs to conduct pairwise comparisons and efficiently ranks candidate texts.
arXiv Detail & Related papers (2024-03-25T17:11:28Z)
Leveraging Large Language Models for NLG Evaluation: Advances and Challenges [57.88520765782177]
Large Language Models (LLMs) have opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance. We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to understand and compare these methods. By discussing unresolved challenges, including bias, robustness, domain-specificity, and unified evaluation, this paper seeks to offer insights to researchers and advocate for fairer and more advanced NLG evaluation techniques.
arXiv Detail & Related papers (2024-01-13T15:59:09Z)
Unveiling Bias in Fairness Evaluations of Large Language Models: A Critical Literature Review of Music and Movie Recommendation Systems [0.0]
The rise of generative artificial intelligence, particularly Large Language Models (LLMs), has intensified the imperative to scrutinize fairness alongside accuracy. Recent studies have begun to investigate fairness evaluations for LLMs within domains such as recommendations. Yet, the degree to which current fairness evaluation frameworks account for personalization remains unclear.
arXiv Detail & Related papers (2024-01-08T17:57:29Z)
Towards a Comprehensive Human-Centred Evaluation Framework for Explainable AI [1.7222662622390634]
We propose to adapt the User-Centric Evaluation Framework used in recommender systems. We integrate explanation aspects, summarise explanation properties, indicate relations between them, and categorise metrics that measure these properties.
arXiv Detail & Related papers (2023-07-31T09:20:16Z)
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale. We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units. We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z)
Vote'n'Rank: Revision of Benchmarking with Social Choice Theory [7.224599819499157]
This paper proposes Vote'n'Rank, a framework for ranking systems in multi-task benchmarks under the principles of the social choice theory. We demonstrate that our approach can be efficiently utilised to draw new insights on benchmarking in several ML sub-fields.
arXiv Detail & Related papers (2022-10-11T20:19:11Z)
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.