What Is The Political Content in LLMs' Pre- and Post-Training Data?
- URL: http://arxiv.org/abs/2509.22367v1
- Date: Fri, 26 Sep 2025 14:00:51 GMT
- Title: What Is The Political Content in LLMs' Pre- and Post-Training Data?
- Authors: Tanise Ceron, Dmitry Nikolaev, Dominik Stammbach, Debora Nozza,
- Abstract summary: We present an analysis of the pre- and post-training corpora of OLMO2, the largest fully open-source model released together with its complete dataset.<n>From these corpora, we draw large random samples, automatically annotate documents for political orientation, and analyze their source domains and content.<n>We then assess how political content in the training data correlates with models' stance on specific policy issues.
- Score: 12.72257058961811
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are known to generate politically biased text, yet how such biases arise remains unclear. A crucial step toward answering this question is the analysis of training data, whose political content remains largely underexplored in current LLM research. To address this gap, we present in this paper an analysis of the pre- and post-training corpora of OLMO2, the largest fully open-source model released together with its complete dataset. From these corpora, we draw large random samples, automatically annotate documents for political orientation, and analyze their source domains and content. We then assess how political content in the training data correlates with models' stance on specific policy issues. Our analysis shows that left-leaning documents predominate across datasets, with pre-training corpora containing significantly more politically engaged content than post-training data. We also find that left- and right-leaning documents frame similar topics through distinct values and sources of legitimacy. Finally, the predominant stance in the training data strongly correlates with models' political biases when evaluated on policy issues. These findings underscore the need to integrate political content analysis into future data curation pipelines as well as in-depth documentation of filtering strategies for transparency.
Related papers
- Exploiting contextual information to improve stance detection in informal political discourse with LLMs [0.0]
This study investigates the use of Large Language Models (LLMs) for political stance detection in informal online discourse.<n>Using a real-world political forum dataset, we generate structured profiles that summarize users' ideological leaning, recurring topics, and linguistic patterns.<n>We show that contextual prompts significantly boost accuracy, with improvements ranging from +17.5% to +38.5%, achieving up to 74% accuracy that surpasses previous approaches.
arXiv Detail & Related papers (2026-02-04T16:49:26Z) - Analyzing Political Text at Scale with Online Tensor LDA [53.16930342547758]
This paper proposes a topic modeling method that scales linearly to billions of documents.<n>We show that this method is computationally and memory efficient (achieving speeds over 3-4x those of prior parallelized Latent Dirichlet Allocation (LDA) methods)<n>We perform two real-world, large-scale new studies of interest to political scientists.
arXiv Detail & Related papers (2025-11-11T03:58:48Z) - Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases [24.622980403581018]
We investigate the extent to which Large Language Models' political leanings reflect memorized patterns from their pretraining corpora.<n>As a case study, we focus on probing the political leanings of LLMs in 32 US Supreme Court cases, addressing contentious topics such as abortion and voting rights.
arXiv Detail & Related papers (2025-02-25T15:16:17Z) - The Impact of Persona-based Political Perspectives on Hateful Content Detection [4.04666623219944]
Politically diverse language models require computational resources often inaccessible to many researchers and organizations.<n>Recent work has established that persona-based prompting can introduce political diversity in model outputs without additional training.<n>We investigate whether such prompting strategies can achieve results comparable to political pretraining for downstream tasks.
arXiv Detail & Related papers (2025-02-01T09:53:17Z) - Political-LLM: Large Language Models in Political Science [159.95299889946637]
Large language models (LLMs) have been widely adopted in political science tasks.<n>Political-LLM aims to advance the comprehensive understanding of integrating LLMs into computational political science.
arXiv Detail & Related papers (2024-12-09T08:47:50Z) - Balancing Transparency and Accuracy: A Comparative Analysis of Rule-Based and Deep Learning Models in Political Bias Classification [5.550237524713089]
The study highlights the sensitivity of modern self-learning systems to unconstrained data ingestion.
Applying both models to left-leaning (CNN) and right-leaning (FOX) news articles, we assess their effectiveness on data beyond the original training and test sets.
We contrast the opaque architecture of a deep learning model with the transparency of a linguistically informed rule-based model.
arXiv Detail & Related papers (2024-11-07T00:09:18Z) - Language Models Learn Metadata: Political Stance Detection Case Study [1.2277343096128712]
This paper investigates the optimal way to incorporate metadata into a political stance detection task.
We show that our simple baseline, using only party membership information, surpasses the current state-of-the-art.
arXiv Detail & Related papers (2024-09-15T14:57:41Z) - Whose Side Are You On? Investigating the Political Stance of Large Language Models [56.883423489203786]
We investigate the political orientation of Large Language Models (LLMs) across a spectrum of eight polarizing topics.
Our investigation delves into the political alignment of LLMs across a spectrum of eight polarizing topics, spanning from abortion to LGBTQ issues.
The findings suggest that users should be mindful when crafting queries, and exercise caution in selecting neutral prompt language.
arXiv Detail & Related papers (2024-03-15T04:02:24Z) - Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation.
We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process.
We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z) - Retrieval Enhanced Data Augmentation for Question Answering on Privacy
Policies [74.01792675564218]
We develop a data augmentation framework based on ensembling retriever models that captures relevant text segments from unlabeled policy documents.
To improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascade them with noise reduction filter models.
Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10% F1) and achieve a new state-of-the-art F1 score of 50%.
arXiv Detail & Related papers (2022-04-19T15:45:23Z) - PolicyQA: A Reading Comprehension Dataset for Privacy Policies [77.79102359580702]
We present PolicyQA, a dataset that contains 25,017 reading comprehension style examples curated from an existing corpus of 115 website privacy policies.
We evaluate two existing neural QA models and perform rigorous analysis to reveal the advantages and challenges offered by PolicyQA.
arXiv Detail & Related papers (2020-10-06T09:04:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.