Human Evaluation of Creative NLG Systems: An Interdisciplinary Survey on
Recent Papers
- URL: http://arxiv.org/abs/2108.00308v1
- Date: Sat, 31 Jul 2021 18:54:30 GMT
- Title: Human Evaluation of Creative NLG Systems: An Interdisciplinary Survey on
Recent Papers
- Authors: Mika H\"am\"al\"ainen and Khalid Alnajjar
- Abstract summary: We survey human evaluation in papers presenting work on creative natural language generation.
The most typical human evaluation method is a scaled survey, typically on a 5 point scale.
The most commonly evaluated parameters are meaning, syntactic correctness, novelty, relevance and emotional value.
- Score: 0.685316573653194
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We survey human evaluation in papers presenting work on creative natural
language generation that have been published in INLG 2020 and ICCC 2020. The
most typical human evaluation method is a scaled survey, typically on a 5 point
scale, while many other less common methods exist. The most commonly evaluated
parameters are meaning, syntactic correctness, novelty, relevance and emotional
value, among many others. Our guidelines for future evaluation include clearly
defining the goal of the generative system, asking questions as concrete as
possible, testing the evaluation setup, using multiple different evaluation
setups, reporting the entire evaluation process and potential biases clearly,
and finally analyzing the evaluation results in a more profound way than merely
reporting the most typical statistics.
Related papers
- What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation [57.550045763103334]
evaluating a story can be more challenging than other generation evaluation tasks.
We first summarize existing storytelling tasks, including text-to-text, visual-to-text, and text-to-visual.
We propose a taxonomy to organize evaluation metrics that have been developed or can be adopted for story evaluation.
arXiv Detail & Related papers (2024-08-26T20:35:42Z) - Learning and Evaluating Human Preferences for Conversational Head
Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - Social Biases in Automatic Evaluation Metrics for NLG [53.76118154594404]
We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics.
We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
arXiv Detail & Related papers (2022-10-17T08:55:26Z) - How to Evaluate Your Dialogue Models: A Review of Approaches [2.7834038784275403]
We are first to divide the evaluation methods into three classes, i.e., automatic evaluation, human-involved evaluation and user simulator based evaluation.
The existence of benchmarks, suitable for the evaluation of dialogue techniques are also discussed in detail.
arXiv Detail & Related papers (2021-08-03T08:52:33Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.