Related papers: CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias

CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias

URL: http://arxiv.org/abs/2308.12539v2
Date: Wed, 24 Jan 2024 01:09:01 GMT
Title: CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias
Authors: Vipul Gupta, Pranav Narayanan Venkit, Hugo Lauren\c{c}on, Shomir Wilson, Rebecca J. Passonneau
Abstract summary: Comprehensive Assessment of Language Models (CALM) for robust measurement of two types of universally relevant sociodemographic bias, gender and race. Our empirical evaluation shows that CALM bias scores are more robust and far less sensitive than previous bias measurements to perturbations in the templates.
Score: 8.031232439623627
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As language models (LMs) become increasingly powerful and widely used, it is important to quantify them for sociodemographic bias with potential for harm. Prior measures of bias are sensitive to perturbations in the templates designed to compare performance across social groups, due to factors such as low diversity or limited number of templates. Also, most previous work considers only one NLP task. We introduce Comprehensive Assessment of Language Models (CALM) for robust measurement of two types of universally relevant sociodemographic bias, gender and race. CALM integrates sixteen datasets for question-answering, sentiment analysis and natural language inference. Examples from each dataset are filtered to produce 224 templates with high diversity (e.g., length, vocabulary). We assemble 50 highly frequent person names for each of seven distinct demographic groups to generate 78,400 prompts covering the three NLP tasks. Our empirical evaluation shows that CALM bias scores are more robust and far less sensitive than previous bias measurements to perturbations in the templates, such as synonym substitution, or to random subset selection of templates. We apply CALM to 20 large language models, and find that for 2 language model series, larger parameter models tend to be more biased than smaller ones. The T0 series is the least biased model families, of the 20 LLMs investigated here. The code is available at https://github.com/vipulgupta1011/CALM.

Related papers

When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification [14.187153195380668]
Large language models have remarkable capabilities across many NLP tasks, but their effectiveness for multilingual claim verification with nuanced classification schemes remains understudied.<n>We evaluate five state-of-the-art language models on the X-Fact dataset, which spans 25 languages with seven distinct veracity categories.<n>Surprisingly, we find that XLM-R substantially outperforms all tested LLMs, achieving 57.7% macro-F1 compared to the best LLM performance of 16.9%.
arXiv Detail & Related papers (2025-07-28T10:49:04Z)
Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models [49.41113560646115]
We investigate various proxy measures of bias in large language models (LLMs)<n>We find that evaluating models with pre-prompted personae on a multi-subject benchmark (MMLU) leads to negligible and mostly random differences in scores.<n>With the recent trend for LLM assistant memory and personalization, these problems open up from a different angle.
arXiv Detail & Related papers (2025-06-12T08:47:40Z)
ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models [75.05436691700572]
We introduce ExpliCa, a new dataset for evaluating Large Language Models (LLMs) in explicit causal reasoning. We tested seven commercial and open-source LLMs on ExpliCa through prompting and perplexity-based metrics. Surprisingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events.
arXiv Detail & Related papers (2025-02-21T14:23:14Z)
With a Grain of SALT: Are LLMs Fair Across Social Dimensions? [3.979019316355144]
This paper presents an analysis of biases in open-source Large Language Models (LLMs) across various genders, religions, and races. We introduce a methodology for generating a bias detection dataset using seven bias triggers: General Debate, Positioned Debate, Career Advice, Story Generation, Problem-Solving, Cover-Letter Writing, and CV Generation. We anonymise the LLM-generated text associated with each group using GPT-4o-mini and do a pairwise comparison using GPT-4o-as-a-Judge.
arXiv Detail & Related papers (2024-10-16T12:22:47Z)
STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions [6.19084217044276]
Mitigating explicit and implicit biases in Large Language Models (LLMs) has become a critical focus in the field of natural language processing. We introduce the Sensitivity Testing on Offensive Progressions dataset, which includes 450 offensive progressions containing 2,700 unique sentences. Our findings reveal that even the best-performing models detect bias inconsistently, with success rates ranging from 19.3% to 69.8%.
arXiv Detail & Related papers (2024-09-20T18:34:38Z)
Do Multilingual Large Language Models Mitigate Stereotype Bias? [9.31741279000585]
This study systematically trains six LLMs of identical size and architecture in English, German, French, Italian, and Spanish. We observe that multilingual models achieve not only lower bias but also superior prediction accuracy when compared to monolingual models.
arXiv Detail & Related papers (2024-07-08T08:46:50Z)
Paraphrase and Aggregate with Large Language Models for Minimizing Intent Classification Errors [19.601600598570215]
We show that large language models (LLM) can achieve high performance on large multi-class classification tasks but still make classification errors and worse, generate out-of-vocabulary class labels. We introduce Paraphrase and AGgregate (PAG)-LLM approach wherein an LLM generates multiple paraphrases of the input query (parallel queries) We show that PAG-LLM is especially effective for hard examples where LLM is uncertain, and reduces the critical misclassification and hallucinated label generation errors.
arXiv Detail & Related papers (2024-06-24T22:30:26Z)
VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model [72.13121434085116]
VLBiasBench is a benchmark aimed at evaluating biases in Large Vision-Language Models (LVLMs) We construct a dataset encompassing nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status and two intersectional bias categories (race x gender, and race x social economic status) We conduct extensive evaluations on 15 open-source models as well as one advanced closed-source model, providing some new insights into the biases revealing from these models.
arXiv Detail & Related papers (2024-06-20T10:56:59Z)
The Impact of Unstated Norms in Bias Analysis of Language Models [0.03495246564946556]
Counterfactual bias evaluation is a widely used approach to quantifying bias. We find that template-based probes can lead to unrealistic bias measurements.
arXiv Detail & Related papers (2024-04-04T14:24:06Z)
Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts) This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z)
Taxonomy-based CheckList for Large Language Model Evaluation [0.0]
We introduce human knowledge into natural language interventions and study pre-trained language models' (LMs) behaviors. Inspired by CheckList behavioral testing, we present a checklist-style task that aims to probe and quantify LMs' unethical behaviors.
arXiv Detail & Related papers (2023-12-15T12:58:07Z)
CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models [52.25049362267279]
We present a Chinese Bias Benchmark dataset that consists of over 100K questions jointly constructed by human experts and generative language models. The testing instances in the dataset are automatically derived from 3K+ high-quality templates manually authored with stringent quality control. Extensive experiments demonstrate the effectiveness of the dataset in detecting model bias, with all 10 publicly available Chinese large language models exhibiting strong bias in certain categories.
arXiv Detail & Related papers (2023-06-28T14:14:44Z)
Holistic Evaluation of Language Models [183.94891340168175]
Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models.
arXiv Detail & Related papers (2022-11-16T18:51:34Z)
Few-shot Instruction Prompts for Pretrained Language Models to Detect Social Biases [55.45617404586874]
We propose a few-shot instruction-based method for prompting pre-trained language models (LMs) We show that large LMs can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models.
arXiv Detail & Related papers (2021-12-15T04:19:52Z)
LOGAN: Local Group Bias Detection by Clustering [86.38331353310114]
We argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model. We propose LOGAN, a new bias detection technique based on clustering. Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region.
arXiv Detail & Related papers (2020-10-06T16:42:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.