BBQ: A Hand-Built Bias Benchmark for Question Answering
- URL: http://arxiv.org/abs/2110.08193v1
- Date: Fri, 15 Oct 2021 16:43:46 GMT
- Title: BBQ: A Hand-Built Bias Benchmark for Question Answering
- Authors: Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar,
Jason Phang, Jana Thompson, Phu Mon Htut, Samuel R. Bowman
- Abstract summary: It is well documented that NLP models learn social biases present in the world, but little work has been done to show how these biases manifest in actual model outputs for applied tasks like question answering (QA)
We introduce the Bias Benchmark for QA (BBQ), a dataset consisting of question-sets constructed by the authors that highlight textitattested social biases against people belonging to protected classes along nine different social dimensions relevant for U.S. English-speaking contexts.
We find that models strongly rely on stereotypes when the context is ambiguous, meaning that the model's outputs consistently reproduce harmful biases in this setting
- Score: 25.108222728383236
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It is well documented that NLP models learn social biases present in the
world, but little work has been done to show how these biases manifest in
actual model outputs for applied tasks like question answering (QA). We
introduce the Bias Benchmark for QA (BBQ), a dataset consisting of
question-sets constructed by the authors that highlight \textit{attested}
social biases against people belonging to protected classes along nine
different social dimensions relevant for U.S. English-speaking contexts. Our
task evaluates model responses at two distinct levels: (i) given an
under-informative context, test how strongly model answers reflect social
biases, and (ii) given an adequately informative context, test whether the
model's biases still override a correct answer choice. We find that models
strongly rely on stereotypes when the context is ambiguous, meaning that the
model's outputs consistently reproduce harmful biases in this setting. Though
models are much more accurate when the context provides an unambiguous answer,
they still rely on stereotyped information and achieve an accuracy 2.5
percentage points higher on examples where the correct answer aligns with a
social bias, with this accuracy difference widening to 5 points for examples
targeting gender.
Related papers
- VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model [72.13121434085116]
VLBiasBench is a benchmark aimed at evaluating biases in Large Vision-Language Models (LVLMs)
We construct a dataset encompassing nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status and two intersectional bias categories (race x gender, and race x social economic status)
We conduct extensive evaluations on 15 open-source models as well as one advanced closed-source model, providing some new insights into the biases revealing from these models.
arXiv Detail & Related papers (2024-06-20T10:56:59Z) - COBIAS: Contextual Reliability in Bias Assessment [14.594920595573038]
Large Language Models (LLMs) often inherit biases from the web data they are trained on, which contains stereotypes and prejudices.
Current methods for evaluating and mitigating these biases rely on bias-benchmark datasets.
We introduce a contextual reliability framework, which evaluates model robustness to biased statements by considering the various contexts in which they may appear.
arXiv Detail & Related papers (2024-02-22T10:46:11Z) - SocialStigmaQA: A Benchmark to Uncover Stigma Amplification in
Generative Language Models [8.211129045180636]
We introduce a benchmark meant to capture the amplification of social bias, via stigmas, in generative language models.
Our benchmark, SocialStigmaQA, contains roughly 10K prompts, with a variety of prompt styles, carefully constructed to test for both social bias and model robustness.
We find that the proportion of socially biased output ranges from 45% to 59% across a variety of decoding strategies and prompting styles.
arXiv Detail & Related papers (2023-12-12T18:27:44Z) - Mitigating Bias for Question Answering Models by Tracking Bias Influence [84.66462028537475]
We propose BMBI, an approach to mitigate the bias of multiple-choice QA models.
Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance.
We show that our method could be applied to multiple QA formulations across multiple bias categories.
arXiv Detail & Related papers (2023-10-13T00:49:09Z) - Realistic Conversational Question Answering with Answer Selection based
on Calibrated Confidence and Uncertainty Measurement [54.55643652781891]
Conversational Question Answering (ConvQA) models aim at answering a question with its relevant paragraph and previous question-answer pairs that occurred during conversation multiple times.
We propose to filter out inaccurate answers in the conversation history based on their estimated confidences and uncertainties from the ConvQA model.
We validate our models, Answer Selection-based realistic Conversation Question Answering, on two standard ConvQA datasets.
arXiv Detail & Related papers (2023-02-10T09:42:07Z) - The Tail Wagging the Dog: Dataset Construction Biases of Social Bias
Benchmarks [75.58692290694452]
We compare social biases with non-social biases stemming from choices made during dataset construction that might not even be discernible to the human eye.
We observe that these shallow modifications have a surprising effect on the resulting degree of bias across various models.
arXiv Detail & Related papers (2022-10-18T17:58:39Z) - Few-shot Instruction Prompts for Pretrained Language Models to Detect
Social Biases [55.45617404586874]
We propose a few-shot instruction-based method for prompting pre-trained language models (LMs)
We show that large LMs can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models.
arXiv Detail & Related papers (2021-12-15T04:19:52Z) - UnQovering Stereotyping Biases via Underspecified Questions [68.81749777034409]
We present UNQOVER, a framework to probe and quantify biases through underspecified questions.
We show that a naive use of model scores can lead to incorrect bias estimates due to two forms of reasoning errors.
We use this metric to analyze four important classes of stereotypes: gender, nationality, ethnicity, and religion.
arXiv Detail & Related papers (2020-10-06T01:49:52Z) - Roses Are Red, Violets Are Blue... but Should Vqa Expect Them To? [0.0]
We argue that the standard evaluation metric, which consists in measuring the overall in-domain accuracy, is misleading.
We propose the GQA-OOD benchmark designed to overcome these concerns.
arXiv Detail & Related papers (2020-06-09T08:50:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.