Bilingual Bias in Large Language Models: A Taiwan Sovereignty Benchmark Study
- URL: http://arxiv.org/abs/2602.06371v1
- Date: Fri, 06 Feb 2026 03:57:21 GMT
- Title: Bilingual Bias in Large Language Models: A Taiwan Sovereignty Benchmark Study
- Authors: Ju-Chun Ko,
- Abstract summary: Large Language Models (LLMs) are increasingly deployed in multilingual contexts, yet their consistency across languages on politically sensitive topics remains understudied.<n>This paper presents a systematic benchmark study examining how 17 LLMs respond to questions concerning the sovereignty of the Republic of China (Taiwan) when queried in Chinese versus English.<n>We discover significant language bias -- the phenomenon where the same model produces substantively different political stances depending on the query language.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are increasingly deployed in multilingual contexts, yet their consistency across languages on politically sensitive topics remains understudied. This paper presents a systematic bilingual benchmark study examining how 17 LLMs respond to questions concerning the sovereignty of the Republic of China (Taiwan) when queried in Chinese versus English. We discover significant language bias -- the phenomenon where the same model produces substantively different political stances depending on the query language. Our findings reveal that 15 out of 17 tested models exhibit measurable language bias, with Chinese-origin models showing particularly severe issues including complete refusal to answer or explicit propagation of Chinese Communist Party (CCP) narratives. Notably, only GPT-4o Mini achieves a perfect 10/10 score in both languages. We propose novel metrics for quantifying language bias and consistency, including the Language Bias Score (LBS) and Quality-Adjusted Consistency (QAC). Our benchmark and evaluation framework are open-sourced to enable reproducibility and community extension.
Related papers
- Bias in the Ear of the Listener: Assessing Sensitivity in Audio Language Models Across Linguistic, Demographic, and Positional Variations [21.050704978484784]
We construct and release the BiasInEar dataset, a speech-augmented benchmark based on Global MMLU Lite.<n>We evaluate nine representative models under linguistic (language and accent), demographic (gender), and structural (option order) perturbations.<n>Our findings reveal that MLLMs are relatively robust to demographic factors but highly sensitive to language and option order, suggesting that speech can amplify existing structural biases.
arXiv Detail & Related papers (2026-02-01T05:34:34Z) - Cross-Language Bias Examination in Large Language Models [37.21579885190632]
This study introduces an innovative multilingual bias evaluation framework for assessing bias in Large Language Models.<n>By translating the prompts and word list into five target languages, we compare different types of bias across languages.<n>For example, Arabic and Spanish consistently show higher levels of stereotype bias, while Chinese and English exhibit lower levels of bias.
arXiv Detail & Related papers (2025-12-17T23:22:03Z) - Linguistic Nepotism: Trading-off Quality for Language Preference in Multilingual RAG [55.258582772528506]
We investigate whether the mixture of different document languages impacts generation and citation in unintended ways.<n>Across eight languages and six open-weight models, we find that models preferentially cite English sources when queries are in English.<n>We find that models sometimes trade-off document relevance for language preference, indicating that citation choices are not always driven by informativeness alone.
arXiv Detail & Related papers (2025-09-17T12:58:18Z) - Delving into Multilingual Ethical Bias: The MSQAD with Statistical Hypothesis Tests for Large Language Models [7.480124826347168]
This paper investigates the validation and comparison of the ethical biases of LLMs concerning globally discussed and potentially sensitive topics.<n>We collected news articles from Human Rights Watch covering 17 topics, and generated socially sensitive questions along with corresponding responses in multiple languages.<n>We scrutinized the biases of these responses across languages and topics, employing two statistical hypothesis tests.
arXiv Detail & Related papers (2025-05-25T12:25:44Z) - Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models [55.14276067678253]
This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in Large Language Models (LLMs)<n>We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models.<n>Further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns.
arXiv Detail & Related papers (2025-05-24T12:31:27Z) - Beyond Early-Token Bias: Model-Specific and Language-Specific Position Effects in Multilingual LLMs [50.07451351559251]
We present a study across five typologically distinct languages (English, Russian, German, Hindi, and Vietnamese)<n>We examine how position bias interacts with prompt strategies and affects output entropy.
arXiv Detail & Related papers (2025-05-22T02:23:00Z) - Mapping Geopolitical Bias in 11 Large Language Models: A Bilingual, Dual-Framing Analysis of U.S.-China Tensions [2.8202443616982884]
This study systematically analyzes geopolitical bias across 11 prominent Large Language Models (LLMs)<n>We generated 19,712 prompts designed to detect ideological leanings in model outputs.<n>U.S.-based models predominantly favored Pro-U.S. stances, while Chinese-origin models exhibited pronounced Pro-China biases.
arXiv Detail & Related papers (2025-03-31T03:38:17Z) - Assessing Agentic Large Language Models in Multilingual National Bias [31.67058518564021]
Cross-language disparities in reasoning-based recommendations remain largely unexplored.<n>This study is the first to address this gap.<n>We investigate multilingual bias in state-of-the-art LLMs by analyzing their responses to decision-making tasks across multiple languages.
arXiv Detail & Related papers (2025-02-25T08:07:42Z) - CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language.
CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances.
To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z) - Multi-EuP: The Multilingual European Parliament Dataset for Analysis of
Bias in Information Retrieval [62.82448161570428]
This dataset is designed to investigate fairness in a multilingual information retrieval context.
It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages.
It offers rich demographic information associated with its documents, facilitating the study of demographic bias.
arXiv Detail & Related papers (2023-11-03T12:29:11Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.