Related papers: Disentangling Language and Culture for Evaluating Multilingual Large Language Models

Disentangling Language and Culture for Evaluating Multilingual Large Language Models

URL: http://arxiv.org/abs/2505.24635v1
Date: Fri, 30 May 2025 14:25:45 GMT
Title: Disentangling Language and Culture for Evaluating Multilingual Large Language Models
Authors: Jiahao Ying, Wei Tang, Yiran Zhao, Yixin Cao, Yu Rong, Wenxuan Zhang,
Abstract summary: This paper introduces a Dual Evaluation Framework to comprehensively assess the multilingual capabilities of LLMs.<n>By decomposing the evaluation along the dimensions of linguistic medium and cultural context, this framework enables a nuanced analysis of LLMs' ability to process questions cross-lingually.
Score: 48.06219053598005
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces a Dual Evaluation Framework to comprehensively assess the multilingual capabilities of LLMs. By decomposing the evaluation along the dimensions of linguistic medium and cultural context, this framework enables a nuanced analysis of LLMs' ability to process questions within both native and cross-cultural contexts cross-lingually. Extensive evaluations are conducted on a wide range of models, revealing a notable "CulturalLinguistic Synergy" phenomenon, where models exhibit better performance when questions are culturally aligned with the language. This phenomenon is further explored through interpretability probing, which shows that a higher proportion of specific neurons are activated in a language's cultural context. This activation proportion could serve as a potential indicator for evaluating multilingual performance during model training. Our findings challenge the prevailing notion that LLMs, primarily trained on English data, perform uniformly across languages and highlight the necessity of culturally and linguistically model evaluations. Our code can be found at https://yingjiahao14. github.io/Dual-Evaluation/.

Related papers

I Am Aligned, But With Whom? MENA Values Benchmark for Evaluating Cultural Alignment and Multilingual Bias in LLMs [5.060243371992739]
We introduce MENAValues, a novel benchmark designed to evaluate the cultural alignment and multilingual biases of large language models (LLMs)<n> Drawing from large-scale, authoritative human surveys, we curate a structured dataset that captures the sociocultural landscape of MENA with population-level response distributions from 16 countries.<n>Our analysis reveals three critical phenomena: "Cross-Lingual Value Shifts" where identical questions yield drastically different responses based on language, "Reasoning-Induced Degradation" where prompting models to explain their reasoning worsens cultural alignment, and "Logit Leakage" where models refuse sensitive questions while internal probabilities reveal strong hidden
arXiv Detail & Related papers (2025-10-15T05:10:57Z)
MyCulture: Exploring Malaysia's Diverse Culture under Low-Resource Language Constraints [7.822567458977689]
MyCulture is a benchmark designed to comprehensively evaluate Large Language Models (LLMs) on Malaysian culture.<n>Unlike conventional benchmarks, MyCulture employs a novel open-ended multiple-choice question format without predefined options.<n>We analyze structural bias by comparing model performance on structured versus free-form outputs, and assess language bias through multilingual prompt variations.
arXiv Detail & Related papers (2025-08-07T14:17:43Z)
A Dual-Layered Evaluation of Geopolitical and Cultural Bias in LLMs [0.6494933736121663]
Large language models (LLMs) are increasingly deployed across diverse linguistic and cultural contexts.<n>This paper defines two types of bias in LLMs: model bias (bias stemming from model training) and inference bias (bias induced by the language of the query)<n>We construct a manually curated dataset spanning both factual and disputable QA, across four languages and question types.
arXiv Detail & Related papers (2025-06-27T03:37:15Z)
MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs [26.806566827956875]
MAKIEval is an automatic multilingual framework for evaluating cultural awareness in large language models.<n>It automatically identifies cultural entities in model outputs and links them to structured knowledge.<n>We assess 7 LLMs developed from different parts of the world, encompassing both open-source and proprietary systems.
arXiv Detail & Related papers (2025-05-27T19:29:40Z)
Understanding LLMs' Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From [61.63091726904068]
We evaluate the cross-lingual context retrieval ability of over 40 large language models (LLMs) across 12 languages.<n>Several small, post-trained open LLMs show strong cross-lingual context retrieval ability.<n>Our results also indicate that larger-scale pretraining cannot improve the xMRC performance.
arXiv Detail & Related papers (2025-04-15T06:35:27Z)
XIFBench: Evaluating Large Language Models on Multilingual Instruction Following [59.549015333755186]
Large Language Models (LLMs) have demonstrated remarkable instruction-following capabilities across various applications.<n>Existing evaluations lack fine-grained constraint analysis across diverse linguistic contexts.<n>We introduce XIFBench, a comprehensive benchmark for evaluating multilingual instruction-following abilities of LLMs.
arXiv Detail & Related papers (2025-03-10T17:07:52Z)
Multi-ToM: Evaluating Multilingual Theory of Mind Capabilities in Large Language Models [3.9532244541907793]
Theory of Mind (ToM) refers to the cognitive ability to infer and attribute mental states to oneself and others. It remains unclear to what extent large language models (LLMs) demonstrate ToM across diverse languages and cultural contexts. This paper introduces a comprehensive study of multilingual ToM capabilities aimed at addressing this gap.
arXiv Detail & Related papers (2024-11-24T22:37:59Z)
Large Language Models as Neurolinguistic Subjects: Discrepancy in Performance and Competence for Form and Meaning [49.60849499134362]
This study investigates the linguistic understanding of Large Language Models (LLMs) regarding signifier (form) and signified (meaning)<n>We introduce a neurolinguistic approach, utilizing a novel method that combines minimal pair and diagnostic probing to analyze activation patterns across model layers.<n>We found: (1) Psycholinguistic and neurolinguistic methods reveal that language performance and competence are distinct; (2) Direct probability measurement may not accurately assess linguistic competence; and (3) Instruction tuning won't change much competence but improve performance.
arXiv Detail & Related papers (2024-11-12T04:16:44Z)
Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models [22.859955360764275]
We introduce the MultiLingual Needle-in-a-Haystack (MLNeedle) test to assess a model's ability to retrieve relevant information. We evaluate four state-of-the-art large language models on MLNeedle.
arXiv Detail & Related papers (2024-08-19T17:02:06Z)
The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance. Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes. We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z)
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs) We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.