Steering Towards Fairness: Mitigating Political Bias in LLMs
- URL: http://arxiv.org/abs/2508.08846v3
- Date: Sat, 20 Sep 2025 07:24:55 GMT
- Title: Steering Towards Fairness: Mitigating Political Bias in LLMs
- Authors: Afrozah Nadeem, Mark Dras, Usman Naseem,
- Abstract summary: We employ a framework for probing and mitigating such biases in large language models (LLMs) through analysis of internal model representations.<n>We introduce a comprehensive activation extraction pipeline capable of layer-wise analysis across multiple ideological axes.<n>Our results show that decoder LLMs systematically encode representational bias across layers, which can be leveraged for effective steering vector-based mitigation.
- Score: 16.594400974742523
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in large language models (LLMs) have enabled their widespread use across diverse real-world applications. However, concerns remain about their tendency to encode and reproduce ideological biases along political and economic dimensions. In this paper, we employ a framework for probing and mitigating such biases in decoder-based LLMs through analysis of internal model representations. Grounded in the Political Compass Test (PCT), this method uses contrastive pairs to extract and compare hidden layer activations from models like Mistral and DeepSeek. We introduce a comprehensive activation extraction pipeline capable of layer-wise analysis across multiple ideological axes, revealing meaningful disparities linked to political framing. Our results show that decoder LLMs systematically encode representational bias across layers, which can be leveraged for effective steering vector-based mitigation. This work provides new insights into how political bias is encoded in LLMs and offers a principled approach to debiasing beyond surface-level output interventions.
Related papers
- Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models [67.45032003041399]
We propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs.<n>MPCO adaptively balances the importance of different paradigm representations and guides the global optimisation.<n>Our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs.
arXiv Detail & Related papers (2026-03-05T06:01:26Z) - Policy-Conditioned Policies for Multi-Agent Task Solving [53.67744322553693]
In this work, we propose a paradigm shift that bridges the gap by representing policies as human-interpretable source code.<n>We reformulate the learning problem by utilizing Large Language Models (LLMs) as approximate interpreters.<n>We formalize this process as textitProgrammatic Iterated Best Response (PIBR), an algorithm where the policy code is optimized by textual gradients.
arXiv Detail & Related papers (2025-12-24T07:42:10Z) - Fine-Grained Interpretation of Political Opinions in Large Language Models [19.21833592916603]
Recent work indicates that there is a misalignment between LLMs' responses and their internal intentions.<n>This motivates us to probe LLMs' internal mechanisms and help uncover their internal political states.<n>We designed a four-dimensional political learning framework and constructed a corresponding dataset for fine-grained political concept vector learning.
arXiv Detail & Related papers (2025-06-05T09:06:59Z) - Analyzing Political Bias in LLMs via Target-Oriented Sentiment Classification [4.352835414206441]
Political biases encoded by LLMs might have detrimental effects on downstream applications.<n>We propose a new approach leveraging the observation that LLM sentiment predictions vary with the target entity in the same sentence.<n>We insert 1319 demographically and politically diverse politician names in 450 political sentences and predict target-oriented sentiment using seven models in six widely spoken languages.
arXiv Detail & Related papers (2025-05-26T10:01:24Z) - Judging with Many Minds: Do More Perspectives Mean Less Prejudice? On Bias Amplifications and Resistance in Multi-Agent Based LLM-as-Judge [70.89799989428367]
We conduct a systematic analysis of four diverse bias types: position bias, verbosity bias, chain-of-thought bias, and bandwagon bias.<n>We evaluate these biases across two widely adopted multi-agent LLM-as-Judge frameworks: Multi-Agent-Debate and LLM-as-Meta-Judge.
arXiv Detail & Related papers (2025-05-26T03:56:41Z) - Relative Bias: A Comparative Framework for Quantifying Bias in LLMs [29.112649816695203]
Relative Bias is a method designed to assess how an LLM's behavior deviates from other LLMs within a specified target domain.<n>We introduce two complementary methodologies: (1) Embedding Transformation analysis, which captures relative bias patterns through sentence representations over the embedding space, and (2) LLM-as-a-Judge, which employs a language model to evaluate outputs comparatively.<n>Applying our framework to several case studies on bias and alignment scenarios following by statistical tests for validation, we find strong alignment between the two scoring methods.
arXiv Detail & Related papers (2025-05-22T01:59:54Z) - A Multi-LLM Debiasing Framework [85.17156744155915]
Large Language Models (LLMs) are powerful tools with the potential to benefit society immensely, yet, they have demonstrated biases that perpetuate societal inequalities.
Recent research has shown a growing interest in multi-LLM approaches, which have been demonstrated to be effective in improving the quality of reasoning.
We propose a novel multi-LLM debiasing framework aimed at reducing bias in LLMs.
arXiv Detail & Related papers (2024-09-20T20:24:50Z) - PoliTune: Analyzing the Impact of Data Selection and Fine-Tuning on Economic and Political Biases in Large Language Models [1.1704154007740835]
This work investigates the impact of fine-tuning and data selection on economic and political biases in Large Language Models (LLMs)
We introduce PoliTune, a fine-tuning methodology to explore the systematic aspects of aligning LLMs with specific ideologies.
We introduce a systematic method for using the open-source Llama3-70B for dataset selection, annotation, and synthesizing a preferences dataset for Direct Preference Optimization (DPO) to align the model with a given political ideology.
arXiv Detail & Related papers (2024-04-10T16:30:09Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive [53.08398658452411]
Large Language Models (LLMs) are increasingly utilized in autonomous decision-making.<n>We show that this sampling behavior resembles that of human decision-making.<n>We show that this deviation of a sample from the statistical norm towards a prescriptive component consistently appears in concepts across diverse real-world domains.
arXiv Detail & Related papers (2024-02-16T18:28:43Z) - Sparsity-Guided Holistic Explanation for LLMs with Interpretable
Inference-Time Intervention [53.896974148579346]
Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains.
The enigmatic black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications.
We propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs.
arXiv Detail & Related papers (2023-12-22T19:55:58Z) - Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation.
We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process.
We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.