KoBBQ: Korean Bias Benchmark for Question Answering
- URL: http://arxiv.org/abs/2307.16778v2
- Date: Thu, 25 Jan 2024 12:48:10 GMT
- Title: KoBBQ: Korean Bias Benchmark for Question Answering
- Authors: Jiho Jin, Jiseon Kim, Nayeon Lee, Haneul Yoo, Alice Oh, Hwaran Lee
- Abstract summary: The Bias Benchmark for Question Answering (BBQ) is designed to evaluate social biases of language models (LMs)
We present KoBBQ, a Korean bias benchmark dataset.
We propose a general framework that addresses considerations for cultural adaptation of a dataset.
- Score: 28.091808407408823
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The Bias Benchmark for Question Answering (BBQ) is designed to evaluate
social biases of language models (LMs), but it is not simple to adapt this
benchmark to cultural contexts other than the US because social biases depend
heavily on the cultural context. In this paper, we present KoBBQ, a Korean bias
benchmark dataset, and we propose a general framework that addresses
considerations for cultural adaptation of a dataset. Our framework includes
partitioning the BBQ dataset into three classes--Simply-Transferred (can be
used directly after cultural translation), Target-Modified (requires
localization in target groups), and Sample-Removed (does not fit Korean
culture)-- and adding four new categories of bias specific to Korean culture.
We conduct a large-scale survey to collect and validate the social biases and
the targets of the biases that reflect the stereotypes in Korean culture. The
resulting KoBBQ dataset comprises 268 templates and 76,048 samples across 12
categories of social bias. We use KoBBQ to measure the accuracy and bias scores
of several state-of-the-art multilingual LMs. The results clearly show
differences in the bias of LMs as measured by KoBBQ and a machine-translated
version of BBQ, demonstrating the need for and utility of a well-constructed,
culturally-aware social bias benchmark.
Related papers
- VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model [72.13121434085116]
VLBiasBench is a benchmark aimed at evaluating biases in Large Vision-Language Models (LVLMs)
We construct a dataset encompassing nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status and two intersectional bias categories (race x gender, and race x social economic status)
We conduct extensive evaluations on 15 open-source models as well as one advanced closed-source model, providing some new insights into the biases revealing from these models.
arXiv Detail & Related papers (2024-06-20T10:56:59Z) - MBBQ: A Dataset for Cross-Lingual Comparison of Stereotypes in Generative LLMs [6.781972039785424]
Generative large language models (LLMs) have been shown to exhibit harmful biases and stereotypes.
We present MBBQ, a dataset that measures stereotypes commonly held across Dutch, Spanish, and Turkish languages.
Our results confirm that some non-English languages suffer from bias more than English, even when controlling for cultural shifts.
arXiv Detail & Related papers (2024-06-11T13:23:14Z) - Analyzing Social Biases in Japanese Large Language Models [24.351580958043595]
We construct the Japanese Bias Benchmark dataset for Question Answering (JBBQ) based on the English bias benchmark BBQ.
We analyze social biases in Japanese Large Language Models (LLMs)
prompts with warnings about social biases and Chain-of-Thought prompting reduce the effect of biases in model outputs.
arXiv Detail & Related papers (2024-06-04T07:31:06Z) - CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean [18.526285276022907]
We introduce a benchmark of Cultural and Linguistic Intelligence in Korean dataset comprising 1,995 QA pairs.
CLIcK sources its data from official Korean exams and textbooks, partitioning the questions into eleven categories under the two main categories of language and culture.
Using CLIcK, we test 13 language models to assess their performance. Our evaluation uncovers insights into their performances across the categories, as well as the diverse factors affecting their comprehension.
arXiv Detail & Related papers (2024-03-11T03:54:33Z) - Mitigating Bias for Question Answering Models by Tracking Bias Influence [84.66462028537475]
We propose BMBI, an approach to mitigate the bias of multiple-choice QA models.
Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance.
We show that our method could be applied to multiple QA formulations across multiple bias categories.
arXiv Detail & Related papers (2023-10-13T00:49:09Z) - CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI
Collaboration for Large Language Models [52.25049362267279]
We present a Chinese Bias Benchmark dataset that consists of over 100K questions jointly constructed by human experts and generative language models.
The testing instances in the dataset are automatically derived from 3K+ high-quality templates manually authored with stringent quality control.
Extensive experiments demonstrate the effectiveness of the dataset in detecting model bias, with all 10 publicly available Chinese large language models exhibiting strong bias in certain categories.
arXiv Detail & Related papers (2023-06-28T14:14:44Z) - The Tail Wagging the Dog: Dataset Construction Biases of Social Bias
Benchmarks [75.58692290694452]
We compare social biases with non-social biases stemming from choices made during dataset construction that might not even be discernible to the human eye.
We observe that these shallow modifications have a surprising effect on the resulting degree of bias across various models.
arXiv Detail & Related papers (2022-10-18T17:58:39Z) - Few-shot Instruction Prompts for Pretrained Language Models to Detect
Social Biases [55.45617404586874]
We propose a few-shot instruction-based method for prompting pre-trained language models (LMs)
We show that large LMs can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models.
arXiv Detail & Related papers (2021-12-15T04:19:52Z) - BBQ: A Hand-Built Bias Benchmark for Question Answering [25.108222728383236]
It is well documented that NLP models learn social biases present in the world, but little work has been done to show how these biases manifest in actual model outputs for applied tasks like question answering (QA)
We introduce the Bias Benchmark for QA (BBQ), a dataset consisting of question-sets constructed by the authors that highlight textitattested social biases against people belonging to protected classes along nine different social dimensions relevant for U.S. English-speaking contexts.
We find that models strongly rely on stereotypes when the context is ambiguous, meaning that the model's outputs consistently reproduce harmful biases in this setting
arXiv Detail & Related papers (2021-10-15T16:43:46Z) - UnQovering Stereotyping Biases via Underspecified Questions [68.81749777034409]
We present UNQOVER, a framework to probe and quantify biases through underspecified questions.
We show that a naive use of model scores can lead to incorrect bias estimates due to two forms of reasoning errors.
We use this metric to analyze four important classes of stereotypes: gender, nationality, ethnicity, and religion.
arXiv Detail & Related papers (2020-10-06T01:49:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.