Related papers: R1dacted: Investigating Local Censorship in DeepSeek's R1 Language Model

R1dacted: Investigating Local Censorship in DeepSeek's R1 Language Model

URL: http://arxiv.org/abs/2505.12625v1
Date: Mon, 19 May 2025 02:16:56 GMT
Title: R1dacted: Investigating Local Censorship in DeepSeek's R1 Language Model
Authors: Ali Naseh, Harsh Chaudhari, Jaechul Roh, Mingshi Wu, Alina Oprea, Amir Houmansadr,
Abstract summary: Reports suggest R1 refuses to answer certain prompts related to politically sensitive topics in China.<n>We introduce a large-scale set of heavily curated prompts that get censored by R1, but are not censored by other models.<n>We conduct a comprehensive analysis of R1's censorship patterns, examining their consistency, triggers, and variations across topics, prompt phrasing, and context.
Score: 17.402774424821814
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: DeepSeek recently released R1, a high-performing large language model (LLM) optimized for reasoning tasks. Despite its efficient training pipeline, R1 achieves competitive performance, even surpassing leading reasoning models like OpenAI's o1 on several benchmarks. However, emerging reports suggest that R1 refuses to answer certain prompts related to politically sensitive topics in China. While existing LLMs often implement safeguards to avoid generating harmful or offensive outputs, R1 represents a notable shift - exhibiting censorship-like behavior on politically charged queries. In this paper, we investigate this phenomenon by first introducing a large-scale set of heavily curated prompts that get censored by R1, covering a range of politically sensitive topics, but are not censored by other models. We then conduct a comprehensive analysis of R1's censorship patterns, examining their consistency, triggers, and variations across topics, prompt phrasing, and context. Beyond English-language queries, we explore censorship behavior in other languages. We also investigate the transferability of censorship to models distilled from the R1 language model. Finally, we propose techniques for bypassing or removing this censorship. Our findings reveal possible additional censorship integration likely shaped by design choices during training or alignment, raising concerns about transparency, bias, and governance in language model deployment.

Related papers

Are LLMs Good Safety Agents or a Propaganda Engine? [74.88607730071483]
PSP is a dataset built specifically to probe the refusal behaviors in Large Language Models from an explicitly political context.<n> PSP is built by formatting existing censored content from two data sources, openly available on the internet: sensitive prompts in China generalized to multiple countries, and tweets that have been censored in various countries.<n>We study: 1) impact of political sensitivity in seven LLMs through data-driven (making PSP implicit) and representation-level approaches (erasing the concept of politics); and, 2) vulnerability of models on PSP through prompt injection attacks (PIAs)
arXiv Detail & Related papers (2025-11-28T13:36:00Z)
SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs [57.880467106470775]
Attackers can inject imperceptible perturbations into the training data, causing the model to generate malicious, attacker-controlled captions.<n>We propose Semantic Reward Defense (SRD), a reinforcement learning framework that mitigates backdoor behavior without prior knowledge of triggers.<n>SRD uses a Deep Q-Network to learn policies for applying discrete perturbations to sensitive image regions, aiming to disrupt the activation of malicious pathways.
arXiv Detail & Related papers (2025-06-05T08:22:24Z)
Analysis of LLM Bias (Chinese Propaganda & Anti-US Sentiment) in DeepSeek-R1 vs. ChatGPT o3-mini-high [0.40329768057075643]
DeepSeek-R1 consistently exhibited substantially higher proportions of both propaganda and anti-U.S. sentiment.<n>These biases were not confined to overtly political topics but also permeated cultural and lifestyle content.
arXiv Detail & Related papers (2025-06-02T15:54:06Z)
MMATH: A Multilingual Benchmark for Mathematical Reasoning [94.05289799605957]
We introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages.<n>We observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages.<n>Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models.
arXiv Detail & Related papers (2025-05-25T12:47:39Z)
Discovering Forbidden Topics in Language Models [26.2418673687851]
We develop a refusal discovery method that uses token prefilling to find forbidden topics.<n>We benchmark IPC on Tulu-3-8B, an open-source model with public safety tuning data.<n>Our findings highlight the critical need for refusal discovery methods to detect biases, boundaries, and alignment failures of AI systems.
arXiv Detail & Related papers (2025-05-23T03:49:06Z)
Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning [54.56271651170667]
Existing methods often overfit to rigid templates and lack deep reasoning over deceptive content.<n>We introduce FakeVV, a large-scale benchmark comprising over 100,000 video-text pairs with fine-grained, interpretable annotations.<n>We also propose Fact-R1, a framework that integrates deep reasoning with collaborative rule-based reinforcement learning.
arXiv Detail & Related papers (2025-05-22T16:05:06Z)
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control [7.737740676767729]
We use representation engineering techniques to study open-weights safety-tuned models.<n>We present a method for finding a refusal-compliance vector that detects and controls the level of censorship in model outputs.<n>We show a similar approach can be used to find a vector that suppresses the model's reasoning process, allowing us to remove censorship by applying the negative multiples of this vector.
arXiv Detail & Related papers (2025-04-23T22:47:30Z)
What Large Language Models Do Not Talk About: An Empirical Study of Moderation and Censorship Practices [46.30336056625582]
This work investigates the extent to which Large Language Models refuse to answer or omit information when prompted on political topics.<n>Our analysis covers 14 state-of-the-art models from Western countries, China, and Russia, prompted in all six official United Nations (UN) languages.
arXiv Detail & Related papers (2025-04-04T09:09:06Z)
CensorLab: A Testbed for Censorship Experimentation [15.411134921415567]
We design and implement CensorLab, a generic platform for emulating Internet censorship scenarios.<n>CensorLab aims to support all censorship mechanisms previously or currently deployed by real-world censors.<n>It provides an easy-to-use platform for researchers and practitioners enabling them to perform extensive experimentation.
arXiv Detail & Related papers (2024-12-20T21:17:24Z)
Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation. We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process. We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z)
Wiki-En-ASR-Adapt: Large-scale synthetic dataset for English ASR Customization [66.22007368434633]
We present a first large-scale public synthetic dataset for contextual spellchecking customization of automatic speech recognition (ASR) The proposed approach allows creating millions of realistic examples of corrupted ASR hypotheses and simulate non-trivial biasing lists for the customization task. We report experiments with training an open-source customization model on the proposed dataset and show that the injection of hard negative biasing phrases decreases WER and the number of false alarms.
arXiv Detail & Related papers (2023-09-29T14:18:59Z)
LLM Censorship: A Machine Learning Challenge or a Computer Security Problem? [52.71988102039535]
We show that semantic censorship can be perceived as an undecidable problem. We argue that the challenges extend beyond semantic censorship, as knowledgeable attackers can reconstruct impermissible outputs.
arXiv Detail & Related papers (2023-07-20T09:25:02Z)
COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences. We also propose textscCOLDetector to study output offensiveness of popular Chinese language models. Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z)
Counterfactual VQA: A Cause-Effect Look at Language Bias [117.84189187160005]
VQA models tend to rely on language bias as a shortcut and fail to sufficiently learn the multi-modal knowledge from both vision and language. We propose a novel counterfactual inference framework, which enables us to capture the language bias as the direct causal effect of questions on answers.
arXiv Detail & Related papers (2020-06-08T01:49:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.