Related papers: EsBBQ and CaBBQ: The Spanish and Catalan Bias Benchmarks for Question Answering

EsBBQ and CaBBQ: The Spanish and Catalan Bias Benchmarks for Question Answering

URL: http://arxiv.org/abs/2507.11216v1
Date: Tue, 15 Jul 2025 11:37:30 GMT
Title: EsBBQ and CaBBQ: The Spanish and Catalan Bias Benchmarks for Question Answering
Authors: Valle Ruiz-Fernández, Mario Mina, Júlia Falcão, Luis Vasquez-Reina, Anna Sallés, Aitor Gonzalez-Agirre, Olatz Perez-de-Viñaspre,
Abstract summary: This paper introduces the Spanish and the Catalan Bias Benchmarks for Question Answering (EsBBQ and CaBBQ)<n>Based on the original BBQ, these two parallel datasets are designed to assess social bias across 10 categories using a multiple-choice QA setting.<n>We report evaluation results on different Large Language Models, factoring in model family, size and variant.
Score: 1.6630304911300329
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Previous literature has largely shown that Large Language Models (LLMs) perpetuate social biases learnt from their pre-training data. Given the notable lack of resources for social bias evaluation in languages other than English, and for social contexts outside of the United States, this paper introduces the Spanish and the Catalan Bias Benchmarks for Question Answering (EsBBQ and CaBBQ). Based on the original BBQ, these two parallel datasets are designed to assess social bias across 10 categories using a multiple-choice QA setting, now adapted to the Spanish and Catalan languages and to the social context of Spain. We report evaluation results on different LLMs, factoring in model family, size and variant. Our results show that models tend to fail to choose the correct answer in ambiguous scenarios, and that high QA accuracy often correlates with greater reliance on social biases.

Related papers

CaLMQA: Exploring culturally specific long-form question answering across 23 languages [58.18984409715615]
CaLMQA is a dataset of 51.7K culturally specific questions across 23 different languages.<n>We evaluate factuality, relevance and surface-level quality of LLM-generated long-form answers.
arXiv Detail & Related papers (2024-06-25T17:45:26Z)
Datasets for Multilingual Answer Sentence Selection [59.28492975191415]
We introduce new high-quality datasets for AS2 in five European languages (French, German, Italian, Portuguese, and Spanish) Results indicate that our datasets are pivotal in producing robust and powerful multilingual AS2 models.
arXiv Detail & Related papers (2024-06-14T16:50:29Z)
MBBQ: A Dataset for Cross-Lingual Comparison of Stereotypes in Generative LLMs [6.781972039785424]
Generative large language models (LLMs) have been shown to exhibit harmful biases and stereotypes. We present MBBQ, a dataset that measures stereotypes commonly held across Dutch, Spanish, and Turkish languages. Our results confirm that some non-English languages suffer from bias more than English, even when controlling for cultural shifts.
arXiv Detail & Related papers (2024-06-11T13:23:14Z)
JBBQ: Japanese Bias Benchmark for Analyzing Social Biases in Large Language Models [24.351580958043595]
We construct the Japanese Bias Benchmark dataset for Question Answering (JBBQ) based on the English bias benchmark BBQ.<n>We show that while current open Japanese LLMs with more parameters show improved accuracies on JBBQ, their bias scores increase.<n> prompts with a warning about social biases and chain-of-thought prompting reduce the effect of biases in model outputs.
arXiv Detail & Related papers (2024-06-04T07:31:06Z)
On The Truthfulness of 'Surprisingly Likely' Responses of Large Language Models [5.252280724532548]
We show that the surprisingly likely responses of large language models are more accurate in many cases compared to standard baselines.<n>For example, we observe up to 24 percentage points aggregate improvement on TruthfulQA.<n>We also provide further analysis of the results, including the cases when surprisingly likely responses are less or not more accurate.
arXiv Detail & Related papers (2023-11-13T19:21:25Z)
KoBBQ: Korean Bias Benchmark for Question Answering [28.091808407408823]
The Bias Benchmark for Question Answering (BBQ) is designed to evaluate social biases of language models (LMs) We present KoBBQ, a Korean bias benchmark dataset. We propose a general framework that addresses considerations for cultural adaptation of a dataset.
arXiv Detail & Related papers (2023-07-31T15:44:15Z)
Evaluating and Modeling Attribution for Cross-Lingual Question Answering [80.4807682093432]
This work is the first to study attribution for cross-lingual question answering. We collect data in 5 languages to assess the attribution level of a state-of-the-art cross-lingual QA system. We find that a substantial portion of the answers is not attributable to any retrieved passages.
arXiv Detail & Related papers (2023-05-23T17:57:46Z)
PAXQA: Generating Cross-lingual Question Answering Examples at Training Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages. We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts. We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z)
Generative Language Models for Paragraph-Level Question Generation [79.31199020420827]
Powerful generative models have led to recent progress in question generation (QG) It is difficult to measure advances in QG research since there are no standardized resources that allow a uniform comparison among approaches. We introduce QG-Bench, a benchmark for QG that unifies existing question answering datasets by converting them to a standard QG setting.
arXiv Detail & Related papers (2022-10-08T10:24:39Z)
BBQ: A Hand-Built Bias Benchmark for Question Answering [25.108222728383236]
It is well documented that NLP models learn social biases present in the world, but little work has been done to show how these biases manifest in actual model outputs for applied tasks like question answering (QA) We introduce the Bias Benchmark for QA (BBQ), a dataset consisting of question-sets constructed by the authors that highlight textitattested social biases against people belonging to protected classes along nine different social dimensions relevant for U.S. English-speaking contexts. We find that models strongly rely on stereotypes when the context is ambiguous, meaning that the model's outputs consistently reproduce harmful biases in this setting
arXiv Detail & Related papers (2021-10-15T16:43:46Z)
Cross-Lingual GenQA: A Language-Agnostic Generative Question Answering Approach for Open-Domain Question Answering [76.99585451345702]
Open-Retrieval Generative Question Answering (GenQA) is proven to deliver high-quality, natural-sounding answers in English. We present the first generalization of the GenQA approach for the multilingual environment.
arXiv Detail & Related papers (2021-10-14T04:36:29Z)
Counterfactual VQA: A Cause-Effect Look at Language Bias [117.84189187160005]
VQA models tend to rely on language bias as a shortcut and fail to sufficiently learn the multi-modal knowledge from both vision and language. We propose a novel counterfactual inference framework, which enables us to capture the language bias as the direct causal effect of questions on answers.
arXiv Detail & Related papers (2020-06-08T01:49:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.