Related papers: The Human Evaluation Datasheet 1.0: A Template for Recording Details of Human Evaluation Experiments in NLP

The Human Evaluation Datasheet 1.0: A Template for Recording Details of Human Evaluation Experiments in NLP

URL: http://arxiv.org/abs/2103.09710v1
Date: Wed, 17 Mar 2021 15:08:50 GMT
Title: The Human Evaluation Datasheet 1.0: A Template for Recording Details of Human Evaluation Experiments in NLP
Authors: Anastasia Shimorina and Anya Belz
Abstract summary: The Human Evaluation is a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP) The Human Evaluation is intended to facilitate the recording of properties of human evaluations in sufficient detail.
Score: 1.4467794332678539
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces the Human Evaluation Datasheet, a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP). Originally taking inspiration from seminal papers by Bender and Friedman (2018), Mitchell et al. (2019), and Gebru et al. (2020), the Human Evaluation Datasheet is intended to facilitate the recording of properties of human evaluations in sufficient detail, and with sufficient standardisation, to support comparability, meta-evaluation, and reproducibility tests.

Related papers

Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks [47.40240774236047]
We compare four Chat Llama 2 models against extensive human preferences on more than 11k single-turn and 2k multi-turn dialogues from over 2k human annotators. Most NLP benchmarks strongly correlate with human evaluations, suggesting that cheaper, automated metrics can serve as surprisingly reliable predictors of human preferences.
arXiv Detail & Related papers (2025-02-24T01:01:02Z)
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks [106.09361690937618]
There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations covering a broad range of evaluated properties and types of data. We evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations.
arXiv Detail & Related papers (2024-06-26T14:56:13Z)
It HAS to be Subjective: Human Annotator Simulation via Zero-shot Density Estimation [15.8765167340819]
Human annotator simulation (HAS) serves as a cost-effective substitute for human evaluation such as data annotation and system assessment. Human perception and behaviour during human evaluation exhibit inherent variability due to diverse cognitive processes and subjective interpretations. This paper introduces a novel meta-learning framework that treats HAS as a zero-shot density estimation problem.
arXiv Detail & Related papers (2023-09-30T20:54:59Z)
Learning and Evaluating Human Preferences for Conversational Head Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions. PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z)
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation [176.56131810249602]
evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial. We introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source.
arXiv Detail & Related papers (2023-05-23T17:06:00Z)
Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided. We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z)
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP [84.08476873280644]
Just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach.
arXiv Detail & Related papers (2023-05-02T17:46:12Z)
Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation [68.9440575276396]
This survey aims to provide an overview of the recent research that has leveraged human feedback to improve natural language generation. First, we introduce an encompassing formalization of feedback, and identify and organize existing research into a taxonomy following this formalization. Second, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using the feedback or training feedback models. Third, we provide an overview of the nascent field of AI feedback, which exploits large language models to make judgments based on a set of principles and minimize the need for
arXiv Detail & Related papers (2023-05-01T17:36:06Z)
Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation [35.8129864412223]
This paper proposes a standardized and well-defined human evaluation protocol. We experimentally show that the current automatic measures are incompatible with human perception. We provide insights for designing human evaluation experiments reliably and conclusively.
arXiv Detail & Related papers (2023-04-04T14:14:16Z)
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale. We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units. We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z)
A Review of Human Evaluation for Style Transfer [12.641094377317904]
This paper reviews and summarizes human evaluation practices described in 97 style transfer papers. We find that protocols for human evaluations are often underspecified and not standardized.
arXiv Detail & Related papers (2021-06-09T00:29:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.