Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation
- URL: http://arxiv.org/abs/2402.12649v1
- Date: Tue, 20 Feb 2024 01:49:15 GMT
- Title: Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation
- Authors: Kristian Lum, Jacy Reese Anthis, Chirag Nagpal, Alexander D'Amour
- Abstract summary: We study the correspondence between decontextualized "trick tests" and evaluations that are more grounded in Realistic Use and Tangible Effects.
We compare three de-contextualized evaluations adapted from the current literature to three analogous RUTEd evaluations applied to long-form content generation.
We found no correspondence between trick tests and RUTEd evaluations.
- Score: 55.66090768926881
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bias benchmarks are a popular method for studying the negative impacts of
bias in LLMs, yet there has been little empirical investigation of whether
these benchmarks are actually indicative of how real world harm may manifest in
the real world. In this work, we study the correspondence between such
decontextualized "trick tests" and evaluations that are more grounded in
Realistic Use and Tangible {Effects (i.e. RUTEd evaluations). We explore this
correlation in the context of gender-occupation bias--a popular genre of bias
evaluation. We compare three de-contextualized evaluations adapted from the
current literature to three analogous RUTEd evaluations applied to long-form
content generation. We conduct each evaluation for seven instruction-tuned
LLMs. For the RUTEd evaluations, we conduct repeated trials of three text
generation tasks: children's bedtime stories, user personas, and English
language learning exercises. We found no correspondence between trick tests and
RUTEd evaluations. Specifically, selecting the least biased model based on the
de-contextualized results coincides with selecting the model with the best
performance on RUTEd evaluations only as often as random chance. We conclude
that evaluations that are not based in realistic use are likely insufficient to
mitigate and assess bias and real-world harms.
Related papers
- CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models [58.57987316300529]
Large Language Models (LLMs) are increasingly deployed to handle various natural language processing (NLP) tasks.
To evaluate the biases exhibited by LLMs, researchers have recently proposed a variety of datasets.
We propose CEB, a Compositional Evaluation Benchmark that covers different types of bias across different social groups and tasks.
arXiv Detail & Related papers (2024-07-02T16:31:37Z) - Towards Real World Debiasing: A Fine-grained Analysis On Spurious Correlation [17.080528126651977]
We revisit biased distributions in existing benchmarks and real-world datasets, and propose a fine-grained framework for analyzing dataset bias.
Results show that existing methods are incapable of handling real-world biases.
We propose a simple yet effective approach that can be easily applied to existing debias methods, named Debias in Destruction (DiD)
arXiv Detail & Related papers (2024-05-24T06:06:41Z) - Likelihood-based Mitigation of Evaluation Bias in Large Language Models [37.07596663793111]
Large Language Models (LLMs) are widely used to evaluate natural language generation tasks as automated metrics.
It is possible that there might be a likelihood bias if LLMs are used for evaluation.
arXiv Detail & Related papers (2024-02-25T04:52:02Z) - GPTBIAS: A Comprehensive Framework for Evaluating Bias in Large Language
Models [83.30078426829627]
Large language models (LLMs) have gained popularity and are being widely adopted by a large user community.
The existing evaluation methods have many constraints, and their results exhibit a limited degree of interpretability.
We propose a bias evaluation framework named GPTBIAS that leverages the high performance of LLMs to assess bias in models.
arXiv Detail & Related papers (2023-12-11T12:02:14Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Style Over Substance: Evaluation Biases for Large Language Models [17.13064447978519]
This study investigates the behavior of crowd-sourced and expert annotators, as well as large language models (LLMs)
Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors.
We propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score.
arXiv Detail & Related papers (2023-07-06T14:42:01Z) - Social Biases in Automatic Evaluation Metrics for NLG [53.76118154594404]
We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics.
We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
arXiv Detail & Related papers (2022-10-17T08:55:26Z) - Just Rank: Rethinking Evaluation with Word and Sentence Similarities [105.5541653811528]
intrinsic evaluation for embeddings lags far behind, and there has been no significant update since the past decade.
This paper first points out the problems using semantic similarity as the gold standard for word and sentence embedding evaluations.
We propose a new intrinsic evaluation method called EvalRank, which shows a much stronger correlation with downstream tasks.
arXiv Detail & Related papers (2022-03-05T08:40:05Z) - Measuring Fairness with Biased Rulers: A Survey on Quantifying Biases in
Pretrained Language Models [2.567384209291337]
An increasing awareness of biased patterns in natural language processing resources has motivated many metrics to quantify bias' and fairness'
We survey the existing literature on fairness metrics for pretrained language models and experimentally evaluate compatibility.
We find that many metrics are not compatible and highly depend on (i) templates, (ii) attribute and target seeds and (iii) the choice of embeddings.
arXiv Detail & Related papers (2021-12-14T15:04:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.