Related papers: Don't Change My View: Ideological Bias Auditing in Large Language Models

Don't Change My View: Ideological Bias Auditing in Large Language Models

URL: http://arxiv.org/abs/2509.12652v1
Date: Tue, 16 Sep 2025 04:14:29 GMT
Title: Don't Change My View: Ideological Bias Auditing in Large Language Models
Authors: Paul Kröger, Emilio Barkett,
Abstract summary: We adapt a previously proposed statistical method to the new context of ideological bias auditing.<n>We analyze distributional shifts in model outputs across prompts that are thematically related to a chosen topic.<n>This design makes the method particularly suitable for auditing proprietary black-box systems.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) become increasingly embedded in products used by millions, their outputs may influence individual beliefs and, cumulatively, shape public opinion. If the behavior of LLMs can be intentionally steered toward specific ideological positions, such as political or religious views, then those who control these systems could gain disproportionate influence over public discourse. Although it remains an open question whether LLMs can reliably be guided toward coherent ideological stances and whether such steering can be effectively prevented, a crucial first step is to develop methods for detecting when such steering attempts occur. In this work, we adapt a previously proposed statistical method to the new context of ideological bias auditing. Our approach carries over the model-agnostic design of the original framework, which does not require access to the internals of the language model. Instead, it identifies potential ideological steering by analyzing distributional shifts in model outputs across prompts that are thematically related to a chosen topic. This design makes the method particularly suitable for auditing proprietary black-box systems. We validate our approach through a series of experiments, demonstrating its practical applicability and its potential to support independent post hoc audits of LLM behavior.

Related papers

Beyond the Surface: Probing the Ideological Depth of Large Language Models [3.84754844062131]
This paper investigates the concept of "ideological depth" in Large Language Models (LLMs)<n>We measure the "steerability" of two well-known open-source LLMs using instruction prompting and activation steering.<n>Preliminary analysis reveals that models with lower steerability possess more distinct and abstract ideological features.
arXiv Detail & Related papers (2025-08-29T09:27:01Z)
When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models [9.05950721565821]
We study strategic deception in large language models (LLMs)<n>We induce, detect, and control such deception in CoT-enabled LLMs.<n>We achieve a 40% success rate in eliciting context-appropriate deception without explicit prompts.
arXiv Detail & Related papers (2025-06-05T11:44:19Z)
Probing the Subtle Ideological Manipulation of Large Language Models [0.3745329282477067]
Large Language Models (LLMs) have transformed natural language processing, but concerns have emerged about their susceptibility to ideological manipulation.<n>We introduce a novel multi-task dataset designed to reflect diverse ideological positions through tasks such as ideological QA, statement ranking, manifesto cloze completion, and Congress bill comprehension.<n>Our findings indicate that fine-tuning significantly enhances nuanced ideological alignment, while explicit prompts provide only minor refinements.
arXiv Detail & Related papers (2025-04-19T13:11:50Z)
Cognitive Debiasing Large Language Models for Decision-Making [71.2409973056137]
Large language models (LLMs) have shown potential in supporting decision-making applications.<n>We propose a cognitive debiasing approach, self-adaptive cognitive debiasing (SACD)<n>Our method follows three sequential steps -- bias determination, bias analysis, and cognitive debiasing -- to iteratively mitigate potential cognitive biases in prompts.
arXiv Detail & Related papers (2025-04-05T11:23:05Z)
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence [57.57786477441956]
Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request.<n>We propose a novel gradient-based approach to representation engineering and use it to identify refusal directions.<n>We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions.
arXiv Detail & Related papers (2025-02-24T18:52:59Z)
PRISM: A Methodology for Auditing Biases in Large Language Models [9.751718230639376]
PRISM is a flexible, inquiry-based methodology for auditing Large Language Models. It seeks to illicit such positions indirectly through task-based inquiry prompting rather than direct inquiry of said preferences.
arXiv Detail & Related papers (2024-10-24T16:57:20Z)
Whose Side Are You On? Investigating the Political Stance of Large Language Models [56.883423489203786]
We investigate the political orientation of Large Language Models (LLMs) across a spectrum of eight polarizing topics. Our investigation delves into the political alignment of LLMs across a spectrum of eight polarizing topics, spanning from abortion to LGBTQ issues. The findings suggest that users should be mindful when crafting queries, and exercise caution in selecting neutral prompt language.
arXiv Detail & Related papers (2024-03-15T04:02:24Z)
Prompting Fairness: Integrating Causality to Debias Large Language Models [19.76215433424235]
Large language models (LLMs) are susceptible to generating biased and discriminatory responses.<n>We propose a causality-guided debiasing framework to tackle social biases.
arXiv Detail & Related papers (2024-03-13T17:46:28Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation. We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process. We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
On the steerability of large language models toward data-driven personas [98.9138902560793]
Large language models (LLMs) are known to generate biased responses where the opinions of certain groups and populations are underrepresented. Here, we present a novel approach to achieve controllable generation of specific viewpoints using LLMs.
arXiv Detail & Related papers (2023-11-08T19:01:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.