The Iron(ic) Melting Pot: Reviewing Human Evaluation in Humour, Irony
and Sarcasm Generation
- URL: http://arxiv.org/abs/2311.05552v1
- Date: Thu, 9 Nov 2023 17:50:23 GMT
- Title: The Iron(ic) Melting Pot: Reviewing Human Evaluation in Humour, Irony
and Sarcasm Generation
- Authors: Tyler Loakman, Aaron Maladry, Chenghua Lin
- Abstract summary: We argue that the generation of more esoteric forms of language constitutes a subdomain where the characteristics of selected evaluator panels are of utmost importance.
We perform a critical survey of recent works in NLG to assess how well evaluation procedures are reported in this subdomain.
We note a severe lack of open reporting of evaluator demographic information, and a significant reliance on crowdsourcing platforms for recruitment.
- Score: 16.591822946975547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human evaluation is often considered to be the gold standard method of
evaluating a Natural Language Generation system. However, whilst its importance
is accepted by the community at large, the quality of its execution is often
brought into question. In this position paper, we argue that the generation of
more esoteric forms of language - humour, irony and sarcasm - constitutes a
subdomain where the characteristics of selected evaluator panels are of utmost
importance, and every effort should be made to report demographic
characteristics wherever possible, in the interest of transparency and
replicability. We support these claims with an overview of each language form
and an analysis of examples in terms of how their interpretation is affected by
different participant variables. We additionally perform a critical survey of
recent works in NLG to assess how well evaluation procedures are reported in
this subdomain, and note a severe lack of open reporting of evaluator
demographic information, and a significant reliance on crowdsourcing platforms
for recruitment.
Related papers
- The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - Towards Geographic Inclusion in the Evaluation of Text-to-Image Models [25.780536950323683]
We study how much annotators in Africa, Europe, and Southeast Asia vary in their perception of geographic representation, visual appeal, and consistency in real and generated images.
For example, annotators in different locations often disagree on whether exaggerated, stereotypical depictions of a region are considered geographically representative.
We recommend steps for improved automatic and human evaluations.
arXiv Detail & Related papers (2024-05-07T16:23:06Z) - Towards Personalized Evaluation of Large Language Models with An
Anonymous Crowd-Sourcing Platform [64.76104135495576]
We propose a novel anonymous crowd-sourcing evaluation platform, BingJian, for large language models.
Through this platform, users have the opportunity to submit their questions, testing the models on a personalized and potentially broader range of capabilities.
arXiv Detail & Related papers (2024-03-13T07:31:20Z) - A Literature Review of Literature Reviews in Pattern Analysis and Machine Intelligence [58.6354685593418]
This paper proposes several article-level, field-normalized, and large language model-empowered bibliometric indicators to evaluate reviews.
The newly emerging AI-generated literature reviews are also appraised.
This work offers insights into the current challenges of literature reviews and envisions future directions for their development.
arXiv Detail & Related papers (2024-02-20T11:28:50Z) - LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores [23.568883428947494]
We investigate whether prominent LM-based evaluation metrics demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks.
Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in a reference-free manner without leveraging gold summaries.
These results underscore that assessments provided by generative evaluation models can be influenced by factors beyond the inherent text quality.
arXiv Detail & Related papers (2023-11-16T10:43:26Z) - Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)
We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.
Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Social Biases in Automatic Evaluation Metrics for NLG [53.76118154594404]
We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics.
We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
arXiv Detail & Related papers (2022-10-17T08:55:26Z) - Evaluation of Summarization Systems across Gender, Age, and Race [0.0]
We show that summary evaluation is sensitive to protected attributes.
This can severely bias system development and evaluation, leading us to build models that cater for some groups rather than others.
arXiv Detail & Related papers (2021-10-08T21:30:20Z) - Human Evaluation of Creative NLG Systems: An Interdisciplinary Survey on
Recent Papers [0.685316573653194]
We survey human evaluation in papers presenting work on creative natural language generation.
The most typical human evaluation method is a scaled survey, typically on a 5 point scale.
The most commonly evaluated parameters are meaning, syntactic correctness, novelty, relevance and emotional value.
arXiv Detail & Related papers (2021-07-31T18:54:30Z) - On the Use of Linguistic Features for the Evaluation of Generative
Dialogue Systems [17.749995931459136]
We propose that a metric based on linguistic features may be able to maintain good correlation with human judgment and be interpretable.
To support this proposition, we measure and analyze various linguistic features on dialogues produced by multiple dialogue models.
We find that the features' behaviour is consistent with the known properties of the models tested, and is similar across domains.
arXiv Detail & Related papers (2021-04-13T16:28:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.