Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation
- URL: http://arxiv.org/abs/2402.12649v1
- Date: Tue, 20 Feb 2024 01:49:15 GMT
- Title: Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation
- Authors: Kristian Lum, Jacy Reese Anthis, Chirag Nagpal, Alexander D'Amour
- Abstract summary: We study the correspondence between decontextualized "trick tests" and evaluations that are more grounded in Realistic Use and Tangible Effects.
We compare three de-contextualized evaluations adapted from the current literature to three analogous RUTEd evaluations applied to long-form content generation.
We found no correspondence between trick tests and RUTEd evaluations.
- Score: 55.66090768926881
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bias benchmarks are a popular method for studying the negative impacts of
bias in LLMs, yet there has been little empirical investigation of whether
these benchmarks are actually indicative of how real world harm may manifest in
the real world. In this work, we study the correspondence between such
decontextualized "trick tests" and evaluations that are more grounded in
Realistic Use and Tangible {Effects (i.e. RUTEd evaluations). We explore this
correlation in the context of gender-occupation bias--a popular genre of bias
evaluation. We compare three de-contextualized evaluations adapted from the
current literature to three analogous RUTEd evaluations applied to long-form
content generation. We conduct each evaluation for seven instruction-tuned
LLMs. For the RUTEd evaluations, we conduct repeated trials of three text
generation tasks: children's bedtime stories, user personas, and English
language learning exercises. We found no correspondence between trick tests and
RUTEd evaluations. Specifically, selecting the least biased model based on the
de-contextualized results coincides with selecting the model with the best
performance on RUTEd evaluations only as often as random chance. We conclude
that evaluations that are not based in realistic use are likely insufficient to
mitigate and assess bias and real-world harms.
Related papers
- Rethinking Prompt-based Debiasing in Large Language Models [40.90578215191079]
Investigating bias in large language models (LLMs) is crucial for developing trustworthy AI.
While prompt-based through prompt engineering is common, its effectiveness relies on the assumption that models inherently understand biases.
Our study systematically analyzed this assumption using the BBQ and StereoSet benchmarks on both open-source models as well as commercial GPT model.
arXiv Detail & Related papers (2025-03-12T10:06:03Z) - Beneath the Surface: How Large Language Models Reflect Hidden Bias [7.026605828163043]
We introduce the Hidden Bias Benchmark (HBB), a novel dataset designed to assess hidden bias that bias concepts are hidden within naturalistic, subtly framed contexts in real-world scenarios.
We analyze six state-of-the-art Large Language Models, revealing that while models reduce bias in response to overt bias, they continue to reinforce biases in nuanced settings.
arXiv Detail & Related papers (2025-02-27T04:25:54Z) - Rethinking Relation Extraction: Beyond Shortcuts to Generalization with a Debiased Benchmark [53.876493664396506]
Benchmarks are crucial for evaluating machine learning algorithm performance, facilitating comparison and identifying superior solutions.
This paper addresses the issue of entity bias in relation extraction tasks, where models tend to rely on entity mentions rather than context.
We propose a debiased relation extraction benchmark DREB that breaks the pseudo-correlation between entity mentions and relation types through entity replacement.
To establish a new baseline on DREB, we introduce MixDebias, a debiasing method combining data-level and model training-level techniques.
arXiv Detail & Related papers (2025-01-02T17:01:06Z) - CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models [58.57987316300529]
Large Language Models (LLMs) are increasingly deployed to handle various natural language processing (NLP) tasks.
To evaluate the biases exhibited by LLMs, researchers have recently proposed a variety of datasets.
We propose CEB, a Compositional Evaluation Benchmark that covers different types of bias across different social groups and tasks.
arXiv Detail & Related papers (2024-07-02T16:31:37Z) - Towards Real World Debiasing: A Fine-grained Analysis On Spurious Correlation [17.080528126651977]
We revisit biased distributions in existing benchmarks and real-world datasets, and propose a fine-grained framework for analyzing dataset bias.
Results show that existing methods are incapable of handling real-world biases.
We propose a simple yet effective approach that can be easily applied to existing debias methods, named Debias in Destruction (DiD)
arXiv Detail & Related papers (2024-05-24T06:06:41Z) - Likelihood-based Mitigation of Evaluation Bias in Large Language Models [37.07596663793111]
Large Language Models (LLMs) are widely used to evaluate natural language generation tasks as automated metrics.
It is possible that there might be a likelihood bias if LLMs are used for evaluation.
arXiv Detail & Related papers (2024-02-25T04:52:02Z) - COBIAS: Contextual Reliability in Bias Assessment [14.594920595573038]
Large Language Models (LLMs) often inherit biases from the web data they are trained on, which contains stereotypes and prejudices.
Current methods for evaluating and mitigating these biases rely on bias-benchmark datasets.
We introduce a contextual reliability framework, which evaluates model robustness to biased statements by considering the various contexts in which they may appear.
arXiv Detail & Related papers (2024-02-22T10:46:11Z) - GPTBIAS: A Comprehensive Framework for Evaluating Bias in Large Language
Models [83.30078426829627]
Large language models (LLMs) have gained popularity and are being widely adopted by a large user community.
The existing evaluation methods have many constraints, and their results exhibit a limited degree of interpretability.
We propose a bias evaluation framework named GPTBIAS that leverages the high performance of LLMs to assess bias in models.
arXiv Detail & Related papers (2023-12-11T12:02:14Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Style Over Substance: Evaluation Biases for Large Language Models [17.13064447978519]
This study investigates the behavior of crowd-sourced and expert annotators, as well as large language models (LLMs)
Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors.
We propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score.
arXiv Detail & Related papers (2023-07-06T14:42:01Z) - An Offline Metric for the Debiasedness of Click Models [52.25681483524383]
Click models are a common method for extracting information from user clicks.
Recent work shows that the current evaluation practices in the community fail to guarantee that a well-performing click model generalizes well to downstream tasks.
We introduce the concept of debiasedness in click modeling and derive a metric for measuring it.
arXiv Detail & Related papers (2023-04-19T10:59:34Z) - Social Biases in Automatic Evaluation Metrics for NLG [53.76118154594404]
We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics.
We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
arXiv Detail & Related papers (2022-10-17T08:55:26Z) - The SAME score: Improved cosine based bias score for word embeddings [49.75878234192369]
We introduce SAME, a novel bias score for semantic bias in embeddings.
We show that SAME is capable of measuring semantic bias and identify potential causes for social bias in downstream tasks.
arXiv Detail & Related papers (2022-03-28T09:28:13Z) - Just Rank: Rethinking Evaluation with Word and Sentence Similarities [105.5541653811528]
intrinsic evaluation for embeddings lags far behind, and there has been no significant update since the past decade.
This paper first points out the problems using semantic similarity as the gold standard for word and sentence embedding evaluations.
We propose a new intrinsic evaluation method called EvalRank, which shows a much stronger correlation with downstream tasks.
arXiv Detail & Related papers (2022-03-05T08:40:05Z) - Measuring Fairness with Biased Rulers: A Survey on Quantifying Biases in
Pretrained Language Models [2.567384209291337]
An increasing awareness of biased patterns in natural language processing resources has motivated many metrics to quantify bias' and fairness'
We survey the existing literature on fairness metrics for pretrained language models and experimentally evaluate compatibility.
We find that many metrics are not compatible and highly depend on (i) templates, (ii) attribute and target seeds and (iii) the choice of embeddings.
arXiv Detail & Related papers (2021-12-14T15:04:56Z) - LOGAN: Local Group Bias Detection by Clustering [86.38331353310114]
We argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model.
We propose LOGAN, a new bias detection technique based on clustering.
Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region.
arXiv Detail & Related papers (2020-10-06T16:42:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.