Related papers: PakBBQ: A Culturally Adapted Bias Benchmark for QA

PakBBQ: A Culturally Adapted Bias Benchmark for QA

URL: http://arxiv.org/abs/2508.10186v2
Date: Sun, 28 Sep 2025 20:05:18 GMT
Title: PakBBQ: A Culturally Adapted Bias Benchmark for QA
Authors: Abdullah Hashmat, Muhammad Arham Mirza, Agha Ali Raza,
Abstract summary: We introduce PakBBQ, a culturally and regionally adapted extension of the original Bias Benchmark for Question Answering dataset.<n> PakBBQ comprises over 214 templates, 17180 QA pairs across 8 categories in both English and Urdu, covering eight bias dimensions including age, disability, appearance, gender, socio-economic status, religious, regional affiliation, and language formality that are relevant in Pakistan.
Score: 3.4455728937232597
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the widespread adoption of Large Language Models (LLMs) across various applications, it is empirical to ensure their fairness across all user communities. However, most LLMs are trained and evaluated on Western centric data, with little attention paid to low-resource languages and regional contexts. To address this gap, we introduce PakBBQ, a culturally and regionally adapted extension of the original Bias Benchmark for Question Answering (BBQ) dataset. PakBBQ comprises over 214 templates, 17180 QA pairs across 8 categories in both English and Urdu, covering eight bias dimensions including age, disability, appearance, gender, socio-economic status, religious, regional affiliation, and language formality that are relevant in Pakistan. We evaluate multiple multilingual LLMs under both ambiguous and explicitly disambiguated contexts, as well as negative versus non negative question framings. Our experiments reveal (i) an average accuracy gain of 12\% with disambiguation, (ii) consistently stronger counter bias behaviors in Urdu than in English, and (iii) marked framing effects that reduce stereotypical responses when questions are posed negatively. These findings highlight the importance of contextualized benchmarks and simple prompt engineering strategies for bias mitigation in low resource settings.

Related papers

Bilingual Bias in Large Language Models: A Taiwan Sovereignty Benchmark Study [0.0]
Large Language Models (LLMs) are increasingly deployed in multilingual contexts, yet their consistency across languages on politically sensitive topics remains understudied.<n>This paper presents a systematic benchmark study examining how 17 LLMs respond to questions concerning the sovereignty of the Republic of China (Taiwan) when queried in Chinese versus English.<n>We discover significant language bias -- the phenomenon where the same model produces substantively different political stances depending on the query language.
arXiv Detail & Related papers (2026-02-06T03:57:21Z)
Cross-Language Bias Examination in Large Language Models [37.21579885190632]
This study introduces an innovative multilingual bias evaluation framework for assessing bias in Large Language Models.<n>By translating the prompts and word list into five target languages, we compare different types of bias across languages.<n>For example, Arabic and Spanish consistently show higher levels of stereotype bias, while Chinese and English exhibit lower levels of bias.
arXiv Detail & Related papers (2025-12-17T23:22:03Z)
XLQA: A Benchmark for Locale-Aware Multilingual Open-Domain Question Answering [48.913480244527925]
Large Language Models (LLMs) have shown significant progress in Open-domain question answering (ODQA)<n>Most evaluations focus on English and assume locale-invariant answers across languages.<n>We introduce XLQA, a novel benchmark explicitly designed for locale-sensitive multilingual ODQA.
arXiv Detail & Related papers (2025-08-22T07:00:13Z)
BharatBBQ: A Multilingual Bias Benchmark for Question Answering in the Indian Context [36.56689822791777]
Existing benchmarks, such as the Bias Benchmark for Question Answering (BBQ), primarily focus on Western contexts.<n>We introduce BharatBBQ, a culturally adapted benchmark designed to assess biases in Hindi, English, Marathi, Bengali, Tamil, Telugu, Odia, and Assamese.<n>Our dataset contains 49,108 examples in one language that are expanded using translation and verification to 392,864 examples in eight different languages.
arXiv Detail & Related papers (2025-08-09T20:24:24Z)
Beyond Early-Token Bias: Model-Specific and Language-Specific Position Effects in Multilingual LLMs [50.07451351559251]
We present a study across five typologically distinct languages (English, Russian, German, Hindi, and Vietnamese)<n>We examine how position bias interacts with prompt strategies and affects output entropy.
arXiv Detail & Related papers (2025-05-22T02:23:00Z)
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation [71.59208664920452]
Cultural biases in multilingual datasets pose significant challenges for their effectiveness as global benchmarks.<n>We show that progress on MMLU depends heavily on learning Western-centric concepts, with 28% of all questions requiring culturally sensitive knowledge.<n>We release Global MMLU, an improved MMLU with evaluation coverage across 42 languages.
arXiv Detail & Related papers (2024-12-04T13:27:09Z)
CaLMQA: Exploring culturally specific long-form question answering across 23 languages [58.18984409715615]
CaLMQA is a dataset of 51.7K culturally specific questions across 23 different languages.<n>We evaluate factuality, relevance and surface-level quality of LLM-generated long-form answers.
arXiv Detail & Related papers (2024-06-25T17:45:26Z)
VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model [72.13121434085116]
We introduce VLBiasBench, a benchmark to evaluate biases in Large Vision-Language Models (LVLMs)<n>VLBiasBench features a dataset that covers nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status, as well as two intersectional bias categories: race x gender and race x social economic status.<n>We conduct extensive evaluations on 15 open-source models as well as two advanced closed-source models, yielding new insights into the biases present in these models.
arXiv Detail & Related papers (2024-06-20T10:56:59Z)
MBBQ: A Dataset for Cross-Lingual Comparison of Stereotypes in Generative LLMs [6.781972039785424]
Generative large language models (LLMs) have been shown to exhibit harmful biases and stereotypes. We present MBBQ, a dataset that measures stereotypes commonly held across Dutch, Spanish, and Turkish languages. Our results confirm that some non-English languages suffer from bias more than English, even when controlling for cultural shifts.
arXiv Detail & Related papers (2024-06-11T13:23:14Z)
KoBBQ: Korean Bias Benchmark for Question Answering [28.091808407408823]
The Bias Benchmark for Question Answering (BBQ) is designed to evaluate social biases of language models (LMs) We present KoBBQ, a Korean bias benchmark dataset. We propose a general framework that addresses considerations for cultural adaptation of a dataset.
arXiv Detail & Related papers (2023-07-31T15:44:15Z)
CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models [52.25049362267279]
We present a Chinese Bias Benchmark dataset that consists of over 100K questions jointly constructed by human experts and generative language models. The testing instances in the dataset are automatically derived from 3K+ high-quality templates manually authored with stringent quality control. Extensive experiments demonstrate the effectiveness of the dataset in detecting model bias, with all 10 publicly available Chinese large language models exhibiting strong bias in certain categories.
arXiv Detail & Related papers (2023-06-28T14:14:44Z)
Counterfactual VQA: A Cause-Effect Look at Language Bias [117.84189187160005]
VQA models tend to rely on language bias as a shortcut and fail to sufficiently learn the multi-modal knowledge from both vision and language. We propose a novel counterfactual inference framework, which enables us to capture the language bias as the direct causal effect of questions on answers.
arXiv Detail & Related papers (2020-06-08T01:49:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.