Related papers: On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena

On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena

URL: http://arxiv.org/abs/2501.04662v1
Date: Wed, 08 Jan 2025 18:15:47 GMT
Title: On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena
Authors: Tarek Naous, Wei Xu,
Abstract summary: This paper aims to uncover the origins of entity-related cultural biases in Language Models (LMs)<n>We introduce CAMeL-2, a parallel Arabic-English benchmark of 58,086 entities associated with Arab and Western cultures and 367 masked natural contexts for entities.<n>Our evaluations using CAMeL-2 reveal reduced performance gaps between cultures by LMs when tested in English compared to Arabic.
Score: 10.263201685476492
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language Models (LMs) have been shown to exhibit a strong preference towards entities associated with Western culture when operating in non-Western languages. In this paper, we aim to uncover the origins of entity-related cultural biases in LMs by analyzing several contributing factors, including the representation of entities in pre-training data and the impact of variations in linguistic phenomena across languages. We introduce CAMeL-2, a parallel Arabic-English benchmark of 58,086 entities associated with Arab and Western cultures and 367 masked natural contexts for entities. Our evaluations using CAMeL-2 reveal reduced performance gaps between cultures by LMs when tested in English compared to Arabic. We find that LMs struggle in Arabic with entities that appear at high frequencies in pre-training, where entities can hold multiple word senses. This also extends to entities that exhibit high lexical overlap with languages that are not Arabic but use the Arabic script. Further, we show how frequency-based tokenization leads to this issue in LMs, which gets worse with larger Arabic vocabularies. We will make CAMeL-2 available at: https://github.com/tareknaous/camel2

Related papers

Do You Know About My Nation? Investigating Multilingual Language Models' Cultural Literacy Through Factual Knowledge [68.6805229085352]
Most multilingual question-answering benchmarks do not factor in regional diversity in the information they capture.<n>XNationQA encompasses a total of 49,280 questions on the geography, culture, and history of nine countries, presented in seven languages.<n>We benchmark eight standard multilingual LLMs on XNationQA and evaluate them using two novel transference metrics.
arXiv Detail & Related papers (2025-11-01T18:41:34Z)
Camellia: Benchmarking Cultural Biases in LLMs for Asian Languages [46.3747338016989]
We introduce Camellia, a benchmark for measuring entity-centric cultural biases in nine Asian languages spanning six distinct Asian cultures.<n>We evaluate cultural biases in four recent multilingual Large Language Models across various tasks such as cultural context adaptation, sentiment association, and entity extractive QA.<n>Our analyses show a struggle by LLMs at cultural adaptation in all Asian languages, with performance differing across models developed in regions with varying access to culturally-relevant data.
arXiv Detail & Related papers (2025-10-06T18:59:11Z)
NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities [28.926075586175173]
Enhancing the linguistic capabilities of Large Language Models (LLMs) to include low-resource languages is a critical research area.<n>Current research directions rely on synthetic data generated by translating English corpora.<n>This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community.
arXiv Detail & Related papers (2025-05-23T21:18:40Z)
CARE: Aligning Language Models for Regional Cultural Awareness [28.676469530858924]
Existing language models (LMs) often exhibit a Western-centric bias and struggle to represent diverse cultural knowledge. Previous attempts to address this rely on synthetic data and express cultural knowledge only in English. We first introduce CARE, a multilingual resource of 24.1k responses with human preferences on 2,580 questions about Chinese and Arab cultures.
arXiv Detail & Related papers (2025-04-07T14:57:06Z)
Jawaher: A Multidialectal Dataset of Arabic Proverbs for LLM Benchmarking [12.078532717928185]
Large language models (LLMs) continue to exhibit biases toward Western, Anglo-centric, or American cultures. We introduce Jawaher, a benchmark designed to assess LLMs' capacity to comprehend and interpret Arabic proverbs. We find that while LLMs can generate idiomatically accurate translations, they struggle with producing culturally nuanced and contextually relevant explanations.
arXiv Detail & Related papers (2025-02-28T22:28:00Z)
Arabizi vs LLMs: Can the Genie Understand the Language of Aladdin? [0.4751886527142778]
Arabizi is a hybrid form of Arabic that incorporates Latin characters and numbers.<n>It poses significant challenges for machine translation due to its lack of formal structure.<n>This research project investigates the model's performance in translating Arabizi into both Modern Standard Arabic and English.
arXiv Detail & Related papers (2025-02-28T11:37:52Z)
Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [55.27025066199226]
This paper addresses the need for democratizing large language models (LLM) in the Arab world.<n>One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding.<n>Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z)
AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs [22.121471902726892]
We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation.<n>First-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions.
arXiv Detail & Related papers (2024-09-17T17:59:25Z)
Bilingual Adaptation of Monolingual Foundation Models [48.859227944759986]
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language. Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix. By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic.
arXiv Detail & Related papers (2024-07-13T21:09:38Z)
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, i.e., be crosslingual? This study evaluates state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z)
See It from My Perspective: How Language Affects Cultural Bias in Image Understanding [60.70852566256668]
Vision-language models (VLMs) can respond to queries about images in many languages. We characterize the Western bias of VLMs in image understanding and investigate the role that language plays in this disparity.
arXiv Detail & Related papers (2024-06-17T15:49:51Z)
AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z)
Cross-Lingual Knowledge Editing in Large Language Models [73.12622532088564]
Knowledge editing has been shown to adapt large language models to new knowledge without retraining from scratch. It is still unknown the effect of source language editing on a different target language. We first collect a large-scale cross-lingual synthetic dataset by translating ZsRE from English to Chinese.
arXiv Detail & Related papers (2023-09-16T11:07:52Z)
Having Beer after Prayer? Measuring Cultural Bias in Large Language Models [25.722262209465846]
We show that multilingual and Arabic monolingual LMs exhibit bias towards entities associated with Western culture. We introduce CAMeL, a novel resource of 628 naturally-occurring prompts and 20,368 entities spanning eight types that contrast Arab and Western cultures. Using CAMeL, we examine the cross-cultural performance in Arabic of 16 different LMs on tasks such as story generation, NER, and sentiment analysis.
arXiv Detail & Related papers (2023-05-23T18:27:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.