On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena
- URL: http://arxiv.org/abs/2501.04662v1
- Date: Wed, 08 Jan 2025 18:15:47 GMT
- Title: On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena
- Authors: Tarek Naous, Wei Xu,
- Abstract summary: This paper aims to uncover the origins of entity-related cultural biases in Language Models (LMs)<n>We introduce CAMeL-2, a parallel Arabic-English benchmark of 58,086 entities associated with Arab and Western cultures and 367 masked natural contexts for entities.<n>Our evaluations using CAMeL-2 reveal reduced performance gaps between cultures by LMs when tested in English compared to Arabic.
- Score: 10.263201685476492
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language Models (LMs) have been shown to exhibit a strong preference towards entities associated with Western culture when operating in non-Western languages. In this paper, we aim to uncover the origins of entity-related cultural biases in LMs by analyzing several contributing factors, including the representation of entities in pre-training data and the impact of variations in linguistic phenomena across languages. We introduce CAMeL-2, a parallel Arabic-English benchmark of 58,086 entities associated with Arab and Western cultures and 367 masked natural contexts for entities. Our evaluations using CAMeL-2 reveal reduced performance gaps between cultures by LMs when tested in English compared to Arabic. We find that LMs struggle in Arabic with entities that appear at high frequencies in pre-training, where entities can hold multiple word senses. This also extends to entities that exhibit high lexical overlap with languages that are not Arabic but use the Arabic script. Further, we show how frequency-based tokenization leads to this issue in LMs, which gets worse with larger Arabic vocabularies. We will make CAMeL-2 available at: https://github.com/tareknaous/camel2
Related papers
- CARE: Aligning Language Models for Regional Cultural Awareness [28.676469530858924]
Existing language models (LMs) often exhibit a Western-centric bias and struggle to represent diverse cultural knowledge.
Previous attempts to address this rely on synthetic data and express cultural knowledge only in English.
We first introduce CARE, a multilingual resource of 24.1k responses with human preferences on 2,580 questions about Chinese and Arab cultures.
arXiv Detail & Related papers (2025-04-07T14:57:06Z) - Jawaher: A Multidialectal Dataset of Arabic Proverbs for LLM Benchmarking [12.078532717928185]
Large language models (LLMs) continue to exhibit biases toward Western, Anglo-centric, or American cultures.
We introduce Jawaher, a benchmark designed to assess LLMs' capacity to comprehend and interpret Arabic proverbs.
We find that while LLMs can generate idiomatically accurate translations, they struggle with producing culturally nuanced and contextually relevant explanations.
arXiv Detail & Related papers (2025-02-28T22:28:00Z) - Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [55.27025066199226]
This paper addresses the need for democratizing large language models (LLM) in the Arab world.<n>One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding.<n>Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z) - AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs [22.121471902726892]
We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation.<n>First-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions.
arXiv Detail & Related papers (2024-09-17T17:59:25Z) - Bilingual Adaptation of Monolingual Foundation Models [48.859227944759986]
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language.
Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix.
By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic.
arXiv Detail & Related papers (2024-07-13T21:09:38Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, i.e., be crosslingual?
This study evaluates state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - See It from My Perspective: How Language Affects Cultural Bias in Image Understanding [60.70852566256668]
Vision-language models (VLMs) can respond to queries about images in many languages.
We characterize the Western bias of VLMs in image understanding and investigate the role that language plays in this disparity.
arXiv Detail & Related papers (2024-06-17T15:49:51Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - Cross-Lingual Knowledge Editing in Large Language Models [73.12622532088564]
Knowledge editing has been shown to adapt large language models to new knowledge without retraining from scratch.
It is still unknown the effect of source language editing on a different target language.
We first collect a large-scale cross-lingual synthetic dataset by translating ZsRE from English to Chinese.
arXiv Detail & Related papers (2023-09-16T11:07:52Z) - Having Beer after Prayer? Measuring Cultural Bias in Large Language Models [25.722262209465846]
We show that multilingual and Arabic monolingual LMs exhibit bias towards entities associated with Western culture.
We introduce CAMeL, a novel resource of 628 naturally-occurring prompts and 20,368 entities spanning eight types that contrast Arab and Western cultures.
Using CAMeL, we examine the cross-cultural performance in Arabic of 16 different LMs on tasks such as story generation, NER, and sentiment analysis.
arXiv Detail & Related papers (2023-05-23T18:27:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.