The Human Evaluation Datasheet 1.0: A Template for Recording Details of
Human Evaluation Experiments in NLP
- URL: http://arxiv.org/abs/2103.09710v1
- Date: Wed, 17 Mar 2021 15:08:50 GMT
- Title: The Human Evaluation Datasheet 1.0: A Template for Recording Details of
Human Evaluation Experiments in NLP
- Authors: Anastasia Shimorina and Anya Belz
- Abstract summary: The Human Evaluation is a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP)
The Human Evaluation is intended to facilitate the recording of properties of human evaluations in sufficient detail.
- Score: 1.4467794332678539
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces the Human Evaluation Datasheet, a template for
recording the details of individual human evaluation experiments in Natural
Language Processing (NLP). Originally taking inspiration from seminal papers by
Bender and Friedman (2018), Mitchell et al. (2019), and Gebru et al. (2020),
the Human Evaluation Datasheet is intended to facilitate the recording of
properties of human evaluations in sufficient detail, and with sufficient
standardisation, to support comparability, meta-evaluation, and reproducibility
tests.
Related papers
- It HAS to be Subjective: Human Annotator Simulation via Zero-shot
Density Estimation [15.8765167340819]
Human annotator simulation (HAS) serves as a cost-effective substitute for human evaluation such as data annotation and system assessment.
Human perception and behaviour during human evaluation exhibit inherent variability due to diverse cognitive processes and subjective interpretations.
This paper introduces a novel meta-learning framework that treats HAS as a zero-shot density estimation problem.
arXiv Detail & Related papers (2023-09-30T20:54:59Z) - Learning and Evaluating Human Preferences for Conversational Head
Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z) - FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long
Form Text Generation [176.56131810249602]
evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial.
We introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source.
arXiv Detail & Related papers (2023-05-23T17:06:00Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z) - Missing Information, Unresponsive Authors, Experimental Flaws: The
Impossibility of Assessing the Reproducibility of Previous Human Evaluations
in NLP [84.08476873280644]
Just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction.
As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach.
arXiv Detail & Related papers (2023-05-02T17:46:12Z) - Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural
Language Generation [68.9440575276396]
This survey aims to provide an overview of the recent research that has leveraged human feedback to improve natural language generation.
First, we introduce an encompassing formalization of feedback, and identify and organize existing research into a taxonomy following this formalization.
Second, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using the feedback or training feedback models.
Third, we provide an overview of the nascent field of AI feedback, which exploits large language models to make judgments based on a set of principles and minimize the need for
arXiv Detail & Related papers (2023-05-01T17:36:06Z) - Toward Verifiable and Reproducible Human Evaluation for Text-to-Image
Generation [35.8129864412223]
This paper proposes a standardized and well-defined human evaluation protocol.
We experimentally show that the current automatic measures are incompatible with human perception.
We provide insights for designing human evaluation experiments reliably and conclusively.
arXiv Detail & Related papers (2023-04-04T14:14:16Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - A Review of Human Evaluation for Style Transfer [12.641094377317904]
This paper reviews and summarizes human evaluation practices described in 97 style transfer papers.
We find that protocols for human evaluations are often underspecified and not standardized.
arXiv Detail & Related papers (2021-06-09T00:29:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.