Are LLMs Good Safety Agents or a Propaganda Engine?
- URL: http://arxiv.org/abs/2511.23174v1
- Date: Fri, 28 Nov 2025 13:36:00 GMT
- Title: Are LLMs Good Safety Agents or a Propaganda Engine?
- Authors: Neemesh Yadav, Francesco Ortu, Jiarui Liu, Joeun Yook, Bernhard Schölkopf, Rada Mihalcea, Alberto Cazzaniga, Zhijing Jin,
- Abstract summary: PSP is a dataset built specifically to probe the refusal behaviors in Large Language Models from an explicitly political context.<n> PSP is built by formatting existing censored content from two data sources, openly available on the internet: sensitive prompts in China generalized to multiple countries, and tweets that have been censored in various countries.<n>We study: 1) impact of political sensitivity in seven LLMs through data-driven (making PSP implicit) and representation-level approaches (erasing the concept of politics); and, 2) vulnerability of models on PSP through prompt injection attacks (PIAs)
- Score: 74.88607730071483
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are trained to refuse to respond to harmful content. However, systematic analyses of whether this behavior is truly a reflection of its safety policies or an indication of political censorship, that is practiced globally by countries, is lacking. Differentiating between safety influenced refusals or politically motivated censorship is hard and unclear. For this purpose we introduce PSP, a dataset built specifically to probe the refusal behaviors in LLMs from an explicitly political context. PSP is built by formatting existing censored content from two data sources, openly available on the internet: sensitive prompts in China generalized to multiple countries, and tweets that have been censored in various countries. We study: 1) impact of political sensitivity in seven LLMs through data-driven (making PSP implicit) and representation-level approaches (erasing the concept of politics); and, 2) vulnerability of models on PSP through prompt injection attacks (PIAs). Associating censorship with refusals on content with masked implicit intent, we find that most LLMs perform some form of censorship. We conclude with summarizing major attributes that can cause a shift in refusal distributions across models and contexts of different countries.
Related papers
- Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages [57.059267233093465]
Large Language Models (LLMs) have transformed natural language processing, but their safety mechanisms remain under-explored in low-resource, multilingual settings.<n>We introduce textsfSGToxicGuard, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore's diverse linguistic context.<n>We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails.
arXiv Detail & Related papers (2025-09-18T08:14:34Z) - R1dacted: Investigating Local Censorship in DeepSeek's R1 Language Model [17.402774424821814]
Reports suggest R1 refuses to answer certain prompts related to politically sensitive topics in China.<n>We introduce a large-scale set of heavily curated prompts that get censored by R1, but are not censored by other models.<n>We conduct a comprehensive analysis of R1's censorship patterns, examining their consistency, triggers, and variations across topics, prompt phrasing, and context.
arXiv Detail & Related papers (2025-05-19T02:16:56Z) - What Large Language Models Do Not Talk About: An Empirical Study of Moderation and Censorship Practices [46.30336056625582]
This work investigates the extent to which Large Language Models refuse to answer or omit information when prompted on political topics.<n>Our analysis covers 14 state-of-the-art models from Western countries, China, and Russia, prompted in all six official United Nations (UN) languages.
arXiv Detail & Related papers (2025-04-04T09:09:06Z) - Revealing Hidden Mechanisms of Cross-Country Content Moderation with Natural Language Processing [34.69237228285959]
We study content moderation decisions made across countries using pre-existing corpora from the Twitter Stream Grab.<n>Our experiments reveal interesting patterns in censored posts, both across countries and over time.<n>We assess the effectiveness of using LLMs in content moderation.
arXiv Detail & Related papers (2025-03-07T09:49:31Z) - Large Language Models Reflect the Ideology of their Creators [71.65505524599888]
Large language models (LLMs) are trained on vast amounts of data to generate natural language.<n>This paper shows that the ideological stance of an LLM appears to reflect the worldview of its creators.
arXiv Detail & Related papers (2024-10-24T04:02:30Z) - Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks [18.208272960774337]
Large Language Models (LLMs) have sparked widespread concerns about their safety.<n>Recent work demonstrates that safety alignment of LLMs can be easily removed by fine-tuning.<n>We take a further step to understand fine-tuning attacks in multilingual LLMs.
arXiv Detail & Related papers (2024-10-23T18:27:36Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation.
We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process.
We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z) - LLM Censorship: A Machine Learning Challenge or a Computer Security
Problem? [52.71988102039535]
We show that semantic censorship can be perceived as an undecidable problem.
We argue that the challenges extend beyond semantic censorship, as knowledgeable attackers can reconstruct impermissible outputs.
arXiv Detail & Related papers (2023-07-20T09:25:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.