Evaluation of Summarization Systems across Gender, Age, and Race
- URL: http://arxiv.org/abs/2110.04384v1
- Date: Fri, 8 Oct 2021 21:30:20 GMT
- Title: Evaluation of Summarization Systems across Gender, Age, and Race
- Authors: Anna J{\o}rgensen and Anders S{\o}gaard
- Abstract summary: We show that summary evaluation is sensitive to protected attributes.
This can severely bias system development and evaluation, leading us to build models that cater for some groups rather than others.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Summarization systems are ultimately evaluated by human annotators and
raters. Usually, annotators and raters do not reflect the demographics of end
users, but are recruited through student populations or crowdsourcing platforms
with skewed demographics. For two different evaluation scenarios -- evaluation
against gold summaries and system output ratings -- we show that summary
evaluation is sensitive to protected attributes. This can severely bias system
development and evaluation, leading us to build models that cater for some
groups rather than others.
Related papers
- Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs [57.16442740983528]
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback.
The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied.
We focus on how the evaluation of task-oriented dialogue systems ( TDSs) is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated.
arXiv Detail & Related papers (2024-04-19T16:45:50Z) - Towards Personalized Evaluation of Large Language Models with An
Anonymous Crowd-Sourcing Platform [64.76104135495576]
We propose a novel anonymous crowd-sourcing evaluation platform, BingJian, for large language models.
Through this platform, users have the opportunity to submit their questions, testing the models on a personalized and potentially broader range of capabilities.
arXiv Detail & Related papers (2024-03-13T07:31:20Z) - Evaluating Agents using Social Choice Theory [21.26784305333596]
We argue that many general evaluation problems can be viewed through the lens of voting theory.
Each task is interpreted as a separate voter, which requires only ordinal rankings or pairwise comparisons of agents to produce an overall evaluation.
These evaluations are interpretable and flexible, while avoiding many of the problems currently facing cross-task evaluation.
arXiv Detail & Related papers (2023-12-05T20:40:37Z) - The Iron(ic) Melting Pot: Reviewing Human Evaluation in Humour, Irony
and Sarcasm Generation [16.591822946975547]
We argue that the generation of more esoteric forms of language constitutes a subdomain where the characteristics of selected evaluator panels are of utmost importance.
We perform a critical survey of recent works in NLG to assess how well evaluation procedures are reported in this subdomain.
We note a severe lack of open reporting of evaluator demographic information, and a significant reliance on crowdsourcing platforms for recruitment.
arXiv Detail & Related papers (2023-11-09T17:50:23Z) - OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization [52.720711541731205]
We present OpinSummEval, a dataset comprising human judgments and outputs from 14 opinion summarization models.
Our findings indicate that metrics based on neural networks generally outperform non-neural ones.
arXiv Detail & Related papers (2023-10-27T13:09:54Z) - Gender Biases in Automatic Evaluation Metrics for Image Captioning [87.15170977240643]
We conduct a systematic study of gender biases in model-based evaluation metrics for image captioning tasks.
We demonstrate the negative consequences of using these biased metrics, including the inability to differentiate between biased and unbiased generations.
We present a simple and effective way to mitigate the metric bias without hurting the correlations with human judgments.
arXiv Detail & Related papers (2023-05-24T04:27:40Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - Social Biases in Automatic Evaluation Metrics for NLG [53.76118154594404]
We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics.
We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
arXiv Detail & Related papers (2022-10-17T08:55:26Z) - Predicting user demographics based on interest analysis [1.7403133838762448]
This paper proposes a framework to predict users' demographic based on ratings registered by users in a system.
Using all ratings registered by users improves the prediction accuracy by at least 16% compared with previously studied models.
arXiv Detail & Related papers (2021-08-02T16:25:09Z) - Human Evaluation of Creative NLG Systems: An Interdisciplinary Survey on
Recent Papers [0.685316573653194]
We survey human evaluation in papers presenting work on creative natural language generation.
The most typical human evaluation method is a scaled survey, typically on a 5 point scale.
The most commonly evaluated parameters are meaning, syntactic correctness, novelty, relevance and emotional value.
arXiv Detail & Related papers (2021-07-31T18:54:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.