Related papers: Fairness or Fluency? An Investigation into Language Bias of Pairwise LLM-as-a-Judge

Fairness or Fluency? An Investigation into Language Bias of Pairwise LLM-as-a-Judge

URL: http://arxiv.org/abs/2601.13649v1
Date: Tue, 20 Jan 2026 06:33:33 GMT
Title: Fairness or Fluency? An Investigation into Language Bias of Pairwise LLM-as-a-Judge
Authors: Xiaolin Zhou, Zheng Luo, Yicheng Gao, Qixuan Chen, Xiyang Hu, Yue Zhao, Ruishan Liu,
Abstract summary: We study two types of language bias in pairwise LLM-as-a-judge.<n>For same-language judging, there exist significant performance disparities across language families, with European languages consistently outperforming African languages.<n>For inter-language judging, we observe that most models favor English answers, and that this preference is influenced more by answer language than question language.
Score: 9.062065949101001
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in Large Language Models (LLMs) have incentivized the development of LLM-as-a-judge, an application of LLMs where they are used as judges to decide the quality of a certain piece of text given a certain context. However, previous studies have demonstrated that LLM-as-a-judge can be biased towards different aspects of the judged texts, which often do not align with human preference. One of the identified biases is language bias, which indicates that the decision of LLM-as-a-judge can differ based on the language of the judged texts. In this paper, we study two types of language bias in pairwise LLM-as-a-judge: (1) performance disparity between languages when the judge is prompted to compare options from the same language, and (2) bias towards options written in major languages when the judge is prompted to compare options of two different languages. We find that for same-language judging, there exist significant performance disparities across language families, with European languages consistently outperforming African languages, and this bias is more pronounced in culturally-related subjects. For inter-language judging, we observe that most models favor English answers, and that this preference is influenced more by answer language than question language. Finally, we investigate whether language bias is in fact caused by low-perplexity bias, a previously identified bias of LLM-as-a-judge, and we find that while perplexity is slightly correlated with language bias, language bias cannot be fully explained by perplexity only.

Related papers

Cross-Language Bias Examination in Large Language Models [37.21579885190632]
This study introduces an innovative multilingual bias evaluation framework for assessing bias in Large Language Models.<n>By translating the prompts and word list into five target languages, we compare different types of bias across languages.<n>For example, Arabic and Spanish consistently show higher levels of stereotype bias, while Chinese and English exhibit lower levels of bias.
arXiv Detail & Related papers (2025-12-17T23:22:03Z)
Delving into Multilingual Ethical Bias: The MSQAD with Statistical Hypothesis Tests for Large Language Models [7.480124826347168]
This paper investigates the validation and comparison of the ethical biases of LLMs concerning globally discussed and potentially sensitive topics.<n>We collected news articles from Human Rights Watch covering 17 topics, and generated socially sensitive questions along with corresponding responses in multiple languages.<n>We scrutinized the biases of these responses across languages and topics, employing two statistical hypothesis tests.
arXiv Detail & Related papers (2025-05-25T12:25:44Z)
Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English [66.97110551643722]
We investigate dialectal disparities in Large Language Models (LLMs) reasoning tasks.<n>We find that LLMs produce less accurate responses and simpler reasoning chains and explanations for AAE inputs.<n>These findings highlight systematic differences in how LLMs process and reason about different language varieties.
arXiv Detail & Related papers (2025-03-06T05:15:34Z)
Multilingual Relative Clause Attachment Ambiguity Resolution in Large Language Models [2.3749120526936465]
Large language models (LLMs) resolve relative clause (RC) attachment ambiguities.<n>We assess whether LLMs can achieve human-like interpretations amid the complexities of language.<n>We evaluate models in English, Spanish, French, German, Japanese, and Korean.
arXiv Detail & Related papers (2025-03-04T19:56:56Z)
Assessing Agentic Large Language Models in Multilingual National Bias [31.67058518564021]
Cross-language disparities in reasoning-based recommendations remain largely unexplored.<n>This study is the first to address this gap.<n>We investigate multilingual bias in state-of-the-art LLMs by analyzing their responses to decision-making tasks across multiple languages.
arXiv Detail & Related papers (2025-02-25T08:07:42Z)
Can Language Models Learn Typologically Implausible Languages? [62.823015163987996]
Grammatical features across human languages show intriguing correlations often attributed to learning biases in humans.<n>We discuss how language models (LMs) allow us to better determine the role of domain-general learning biases in language universals.<n>We test LMs on an array of highly naturalistic but counterfactual versions of the English (head-initial) and Japanese (head-final) languages.
arXiv Detail & Related papers (2025-02-17T20:40:01Z)
Truth Knows No Language: Evaluating Truthfulness Beyond English [11.20320645651082]
We introduce a professionally translated extension of the TruthfulQA benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and Spanish.<n>Our study evaluates 12 state-of-the-art open LLMs, comparing base and instruction-tuned models using human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring.
arXiv Detail & Related papers (2025-02-13T15:04:53Z)
Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.<n>We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.<n>We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z)
White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs [58.27353205269664]
Social biases can manifest in language agency in Large Language Model (LLM)-generated content.<n>We introduce the Language Agency Bias Evaluation benchmark, which comprehensively evaluates biases in LLMs.<n>Using LABE, we unveil language agency social biases in 3 recent LLMs: ChatGPT, Llama3, and Mistral.
arXiv Detail & Related papers (2024-04-16T12:27:54Z)
Comparing Biases and the Impact of Multilingual Training across Multiple Languages [70.84047257764405]
We present a bias analysis across Italian, Chinese, English, Hebrew, and Spanish on the downstream sentiment analysis task. We adapt existing sentiment bias templates in English to Italian, Chinese, Hebrew, and Spanish for four attributes: race, religion, nationality, and gender. Our results reveal similarities in bias expression such as favoritism of groups that are dominant in each language's culture.
arXiv Detail & Related papers (2023-05-18T18:15:07Z)
Counterfactual VQA: A Cause-Effect Look at Language Bias [117.84189187160005]
VQA models tend to rely on language bias as a shortcut and fail to sufficiently learn the multi-modal knowledge from both vision and language. We propose a novel counterfactual inference framework, which enables us to capture the language bias as the direct causal effect of questions on answers.
arXiv Detail & Related papers (2020-06-08T01:49:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.