MBBQ: A Dataset for Cross-Lingual Comparison of Stereotypes in Generative LLMs
- URL: http://arxiv.org/abs/2406.07243v3
- Date: Wed, 17 Jul 2024 08:49:22 GMT
- Title: MBBQ: A Dataset for Cross-Lingual Comparison of Stereotypes in Generative LLMs
- Authors: Vera Neplenbroek, Arianna Bisazza, Raquel Fernández,
- Abstract summary: Generative large language models (LLMs) have been shown to exhibit harmful biases and stereotypes.
We present MBBQ, a dataset that measures stereotypes commonly held across Dutch, Spanish, and Turkish languages.
Our results confirm that some non-English languages suffer from bias more than English, even when controlling for cultural shifts.
- Score: 6.781972039785424
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative large language models (LLMs) have been shown to exhibit harmful biases and stereotypes. While safety fine-tuning typically takes place in English, if at all, these models are being used by speakers of many different languages. There is existing evidence that the performance of these models is inconsistent across languages and that they discriminate based on demographic factors of the user. Motivated by this, we investigate whether the social stereotypes exhibited by LLMs differ as a function of the language used to prompt them, while controlling for cultural differences and task accuracy. To this end, we present MBBQ (Multilingual Bias Benchmark for Question-answering), a carefully curated version of the English BBQ dataset extended to Dutch, Spanish, and Turkish, which measures stereotypes commonly held across these languages. We further complement MBBQ with a parallel control dataset to measure task performance on the question-answering task independently of bias. Our results based on several open-source and proprietary LLMs confirm that some non-English languages suffer from bias more than English, even when controlling for cultural shifts. Moreover, we observe significant cross-lingual differences in bias behaviour for all except the most accurate models. With the release of MBBQ, we hope to encourage further research on bias in multilingual settings. The dataset and code are available at https://github.com/Veranep/MBBQ.
Related papers
- CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures.
CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions.
We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You [64.74707085021858]
We show that multilingual models suffer from significant gender biases just as monolingual models do.
We propose a novel benchmark, MAGBIG, intended to foster research on gender bias in multilingual models.
Our results show that not only do models exhibit strong gender biases but they also behave differently across languages.
arXiv Detail & Related papers (2024-01-29T12:02:28Z) - Question Translation Training for Better Multilingual Reasoning [108.10066378240879]
Large language models show compelling performance on reasoning tasks but they tend to perform much worse in languages other than English.
A typical solution is to translate instruction data into all languages of interest, and then train on the resulting multilingual data, which is called translate-training.
In this paper we explore the benefits of question alignment, where we train the model to translate reasoning questions into English by finetuning on X-English parallel question data.
arXiv Detail & Related papers (2024-01-15T16:39:10Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages.
In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z) - KoBBQ: Korean Bias Benchmark for Question Answering [28.091808407408823]
The Bias Benchmark for Question Answering (BBQ) is designed to evaluate social biases of language models (LMs)
We present KoBBQ, a Korean bias benchmark dataset.
We propose a general framework that addresses considerations for cultural adaptation of a dataset.
arXiv Detail & Related papers (2023-07-31T15:44:15Z) - How Different Is Stereotypical Bias Across Languages? [1.0467550794914122]
Recent studies have demonstrated how to assess the stereotypical bias in pre-trained English language models.
We make use of the English StereoSet data set (Nadeem et al., 2021), which we semi-automatically translate into German, French, Spanish, and Turkish.
The main takeaways from our analysis are that mGPT-2 shows surprising anti-stereotypical behavior across languages, English (monolingual) models exhibit the strongest bias, and the stereotypes reflected in the data set are least present in Turkish models.
arXiv Detail & Related papers (2023-07-14T13:17:11Z) - Language-Agnostic Bias Detection in Language Models with Bias Probing [22.695872707061078]
Pretrained language models (PLMs) are key components in NLP, but they contain strong social biases.
We propose a bias probing technique called LABDet for evaluating social bias in PLMs with a robust and language-agnostic method.
We find consistent patterns of nationality bias across monolingual PLMs in six languages that align with historical and political context.
arXiv Detail & Related papers (2023-05-22T17:58:01Z) - Gender Bias in Masked Language Models for Multiple Languages [31.528949172210233]
We propose Bias Evaluation (MBE) score, to evaluate bias in various languages using only English attribute word lists and parallel corpora.
We evaluate bias in eight languages using the MBE and confirmed that gender-related biases are encoded in attribute words for all those languages.
arXiv Detail & Related papers (2022-05-01T20:19:14Z) - MuCoT: Multilingual Contrastive Training for Question-Answering in
Low-resource Languages [4.433842217026879]
Multi-lingual BERT-based models (mBERT) are often used to transfer knowledge from high-resource languages to low-resource languages.
We augment the QA samples of the target language using translation and transliteration into other languages and use the augmented data to fine-tune an mBERT-based QA model.
Experiments on the Google ChAII dataset show that fine-tuning the mBERT model with translations from the same language family boosts the question-answering performance.
arXiv Detail & Related papers (2022-04-12T13:52:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.