A Review of Human Evaluation for Style Transfer
- URL: http://arxiv.org/abs/2106.04747v1
- Date: Wed, 9 Jun 2021 00:29:42 GMT
- Title: A Review of Human Evaluation for Style Transfer
- Authors: Eleftheria Briakou, Sweta Agrawal, Ke Zhang, Joel Tetreault and Marine
Carpuat
- Abstract summary: This paper reviews and summarizes human evaluation practices described in 97 style transfer papers.
We find that protocols for human evaluations are often underspecified and not standardized.
- Score: 12.641094377317904
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper reviews and summarizes human evaluation practices described in 97
style transfer papers with respect to three main evaluation aspects: style
transfer, meaning preservation, and fluency. In principle, evaluations by human
raters should be the most reliable. However, in style transfer papers, we find
that protocols for human evaluations are often underspecified and not
standardized, which hampers the reproducibility of research in this field and
progress toward better human and automatic evaluation methods.
Related papers
- ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking.
We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert.
We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z) - Learning and Evaluating Human Preferences for Conversational Head
Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z) - Toward Verifiable and Reproducible Human Evaluation for Text-to-Image
Generation [35.8129864412223]
This paper proposes a standardized and well-defined human evaluation protocol.
We experimentally show that the current automatic measures are incompatible with human perception.
We provide insights for designing human evaluation experiments reliably and conclusively.
arXiv Detail & Related papers (2023-04-04T14:14:16Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - Consultation Checklists: Standardising the Human Evaluation of Medical
Note Generation [58.54483567073125]
We propose a protocol that aims to increase objectivity by grounding evaluations in Consultation Checklists.
We observed good levels of inter-annotator agreement in a first evaluation study using the protocol.
arXiv Detail & Related papers (2022-11-17T10:54:28Z) - Human Judgement as a Compass to Navigate Automatic Metrics for Formality
Transfer [13.886432536330807]
We focus on the task of formality transfer, and on the three aspects that are usually evaluated: style strength, content preservation, and fluency.
We offer some recommendations on the use of such metrics in formality transfer, also with an eye to their generalisability (or not) to related tasks.
arXiv Detail & Related papers (2022-04-15T17:15:52Z) - Counterfactually Evaluating Explanations in Recommender Systems [14.938252589829673]
We propose an offline evaluation method that can be computed without human involvement.
We show that, compared to conventional methods, our method can produce evaluation scores more correlated with the real human judgments.
arXiv Detail & Related papers (2022-03-02T18:55:29Z) - Dynamic Human Evaluation for Relative Model Comparisons [8.843915018287476]
We present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings.
We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study.
arXiv Detail & Related papers (2021-12-15T11:32:13Z) - All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated
Text [46.260544251940125]
We run a study assessing non-experts' ability to distinguish between human- and machine-authored text.
We find that, without training, evaluators distinguished between GPT3- and human-authored text at random chance level.
arXiv Detail & Related papers (2021-06-30T19:00:25Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.