Benchmarking Gender and Political Bias in Large Language Models
- URL: http://arxiv.org/abs/2509.06164v2
- Date: Tue, 16 Sep 2025 05:28:09 GMT
- Title: Benchmarking Gender and Political Bias in Large Language Models
- Authors: Jinrui Yang, Xudong Han, Timothy Baldwin,
- Abstract summary: We introduce EuroParlVote, a novel benchmark for evaluating large language models (LLMs) in politically sensitive contexts.<n>It links European Parliament debate speeches to roll-call vote outcomes and includes rich demographic metadata for each Member of the European Parliament (MEP)<n>Using EuroParlVote, we evaluate state-of-the-art LLMs on two tasks -- gender classification and vote prediction -- revealing consistent patterns of bias.
- Score: 37.192287982246526
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce EuroParlVote, a novel benchmark for evaluating large language models (LLMs) in politically sensitive contexts. It links European Parliament debate speeches to roll-call vote outcomes and includes rich demographic metadata for each Member of the European Parliament (MEP), such as gender, age, country, and political group. Using EuroParlVote, we evaluate state-of-the-art LLMs on two tasks -- gender classification and vote prediction -- revealing consistent patterns of bias. We find that LLMs frequently misclassify female MEPs as male and demonstrate reduced accuracy when simulating votes for female speakers. Politically, LLMs tend to favor centrist groups while underperforming on both far-left and far-right ones. Proprietary models like GPT-4o outperform open-weight alternatives in terms of both robustness and fairness. We release the EuroParlVote dataset, code, and demo to support future research on fairness and accountability in NLP within political contexts.
Related papers
- Uncovering Political Bias in Large Language Models using Parliamentary Voting Records [2.272052150526026]
This paper introduces a general methodology for constructing political bias benchmarks.<n>We instantiate this methodology in three national case studies.<n>We assess ideological tendencies and political entity bias in LLM behavior.
arXiv Detail & Related papers (2026-01-13T18:18:25Z) - Gender and Political Bias in Large Language Models: A Demonstration Platform [12.223144746389371]
ParlAI Vote is an interactive system for exploring European Parliament debates and votes.<n>It includes rich demographic data such as gender, age, country, and political group.<n>Users can browse debates, inspect linked speeches, compare real voting outcomes with predictions from frontier LLMs.
arXiv Detail & Related papers (2025-09-18T04:34:33Z) - Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models [72.89977583150748]
We propose a novel methodology to assess how Large Language Models align with broader geopolitical value systems.<n>We find that LLMs generally favor democratic values and leaders, but exhibit increases favorability toward authoritarian figures when prompted in Mandarin.
arXiv Detail & Related papers (2025-06-15T07:52:07Z) - Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models [3.217354895187878]
We analyze whether zero-shot persona prompting with limited information can accurately predict individual voting decisions.<n>We find that we can simulate voting behavior of Members of the European Parliament reasonably well with a weighted F1 score of approximately 0.793.
arXiv Detail & Related papers (2025-06-13T14:02:21Z) - GenderCARE: A Comprehensive Framework for Assessing and Reducing Gender Bias in Large Language Models [73.23743278545321]
Large language models (LLMs) have exhibited remarkable capabilities in natural language generation, but have also been observed to magnify societal biases.<n>GenderCARE is a comprehensive framework that encompasses innovative Criteria, bias Assessment, Reduction techniques, and Evaluation metrics.
arXiv Detail & Related papers (2024-08-22T15:35:46Z) - Representation Bias in Political Sample Simulations with Large Language Models [54.48283690603358]
This study seeks to identify and quantify biases in simulating political samples with Large Language Models.
Using the GPT-3.5-Turbo model, we leverage data from the American National Election Studies, German Longitudinal Election Study, Zuobiao dataset, and China Family Panel Studies.
arXiv Detail & Related papers (2024-07-16T05:52:26Z) - GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing [72.0343083866144]
This paper introduces the GenderBias-emphVL benchmark to evaluate occupation-related gender bias in Large Vision-Language Models.
Using our benchmark, we extensively evaluate 15 commonly used open-source LVLMs and state-of-the-art commercial APIs.
Our findings reveal widespread gender biases in existing LVLMs.
arXiv Detail & Related papers (2024-06-30T05:55:15Z) - Assessing Political Bias in Large Language Models [0.624709220163167]
We evaluate the political bias of open-source Large Language Models (LLMs) concerning political issues within the European Union (EU) from a German voter's perspective.
We show that larger models, such as Llama3-70B, tend to align more closely with left-leaning political parties, while smaller models often remain neutral.
arXiv Detail & Related papers (2024-05-17T15:30:18Z) - Llama meets EU: Investigating the European Political Spectrum through the Lens of LLMs [18.836470390824633]
We audit Llama Chat in the context of EU politics to analyze the model's political knowledge and its ability to reason in context.
We adapt, i.e., further fine-tune, Llama Chat on speeches of individual euro-parties from debates in the European Parliament to reevaluate its political leaning.
arXiv Detail & Related papers (2024-03-20T13:42:57Z) - Whose Side Are You On? Investigating the Political Stance of Large Language Models [56.883423489203786]
We investigate the political orientation of Large Language Models (LLMs) across a spectrum of eight polarizing topics.
Our investigation delves into the political alignment of LLMs across a spectrum of eight polarizing topics, spanning from abortion to LGBTQ issues.
The findings suggest that users should be mindful when crafting queries, and exercise caution in selecting neutral prompt language.
arXiv Detail & Related papers (2024-03-15T04:02:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.