Social Biases in Automatic Evaluation Metrics for NLG
- URL: http://arxiv.org/abs/2210.08859v1
- Date: Mon, 17 Oct 2022 08:55:26 GMT
- Title: Social Biases in Automatic Evaluation Metrics for NLG
- Authors: Mingqi Gao, Xiaojun Wan
- Abstract summary: We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics.
We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
- Score: 53.76118154594404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many studies have revealed that word embeddings, language models, and models
for specific downstream tasks in NLP are prone to social biases, especially
gender bias. Recently these techniques have been gradually applied to automatic
evaluation metrics for text generation. In the paper, we propose an evaluation
method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings
Association Test (SEAT) to quantify social biases in evaluation metrics and
discover that social biases are also widely present in some model-based
automatic evaluation metrics. Moreover, we construct gender-swapped
meta-evaluation datasets to explore the potential impact of gender bias in
image caption and text summarization tasks. Results show that given
gender-neutral references in the evaluation, model-based evaluation metrics may
show a preference for the male hypothesis, and the performance of them, i.e.
the correlation between evaluation metrics and human judgments, usually has
more significant variation after gender swapping.
Related papers
- Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge [51.93909886542317]
We show how *relying on a single aggregate correlation score* can obscure fundamental differences between human behavior and automatic evaluation methods.
We propose stratifying results by human label uncertainty to provide a more robust analysis of automatic evaluation performance.
arXiv Detail & Related papers (2024-10-03T03:08:29Z) - GenderCARE: A Comprehensive Framework for Assessing and Reducing Gender Bias in Large Language Models [73.23743278545321]
Large language models (LLMs) have exhibited remarkable capabilities in natural language generation, but have also been observed to magnify societal biases.
GenderCARE is a comprehensive framework that encompasses innovative Criteria, bias Assessment, Reduction techniques, and Evaluation metrics.
arXiv Detail & Related papers (2024-08-22T15:35:46Z) - LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores [23.568883428947494]
We investigate whether prominent LM-based evaluation metrics demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks.
Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in a reference-free manner without leveraging gold summaries.
These results underscore that assessments provided by generative evaluation models can be influenced by factors beyond the inherent text quality.
arXiv Detail & Related papers (2023-11-16T10:43:26Z) - Gender Bias in Transformer Models: A comprehensive survey [1.1011268090482573]
Gender bias in artificial intelligence (AI) has emerged as a pressing concern with profound implications for individuals' lives.
This paper presents a comprehensive survey that explores gender bias in Transformer models from a linguistic perspective.
arXiv Detail & Related papers (2023-06-18T11:40:47Z) - Gender Biases in Automatic Evaluation Metrics for Image Captioning [87.15170977240643]
We conduct a systematic study of gender biases in model-based evaluation metrics for image captioning tasks.
We demonstrate the negative consequences of using these biased metrics, including the inability to differentiate between biased and unbiased generations.
We present a simple and effective way to mitigate the metric bias without hurting the correlations with human judgments.
arXiv Detail & Related papers (2023-05-24T04:27:40Z) - Comparing Intrinsic Gender Bias Evaluation Measures without using Human
Annotated Examples [33.044775876807826]
We propose a method to compare intrinsic gender bias evaluation measures without relying on human-annotated examples.
Specifically, we create bias-controlled versions of language models using varying amounts of male vs. female gendered sentences.
The rank correlation between the computed bias scores and the gender proportions used to fine-tune the PLMs is computed.
arXiv Detail & Related papers (2023-01-28T03:11:50Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - Choose Your Lenses: Flaws in Gender Bias Evaluation [29.16221451643288]
We assess the current paradigm of gender bias evaluation and identify several flaws in it.
First, we highlight the importance of extrinsic bias metrics that measure how a model's performance on some task is affected by gender.
Second, we find that datasets and metrics are often coupled, and discuss how their coupling hinders the ability to obtain reliable conclusions.
arXiv Detail & Related papers (2022-10-20T17:59:55Z) - Just Rank: Rethinking Evaluation with Word and Sentence Similarities [105.5541653811528]
intrinsic evaluation for embeddings lags far behind, and there has been no significant update since the past decade.
This paper first points out the problems using semantic similarity as the gold standard for word and sentence embedding evaluations.
We propose a new intrinsic evaluation method called EvalRank, which shows a much stronger correlation with downstream tasks.
arXiv Detail & Related papers (2022-03-05T08:40:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.