Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
- URL: http://arxiv.org/abs/2504.17130v2
- Date: Sat, 26 Apr 2025 20:59:09 GMT
- Title: Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
- Authors: Hannah Cyberey, David Evans,
- Abstract summary: We use representation engineering techniques to study open-weights safety-tuned models.<n>We present a method for finding a refusal-compliance vector that detects and controls the level of censorship in model outputs.<n>We show a similar approach can be used to find a vector that suppresses the model's reasoning process, allowing us to remove censorship by applying the negative multiples of this vector.
- Score: 7.737740676767729
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have transformed the way we access information. These models are often tuned to refuse to comply with requests that are considered harmful and to produce responses that better align with the preferences of those who control the models. To understand how this "censorship" works. We use representation engineering techniques to study open-weights safety-tuned models. We present a method for finding a refusal--compliance vector that detects and controls the level of censorship in model outputs. We also analyze recent reasoning LLMs, distilled from DeepSeek-R1, and uncover an additional dimension of censorship through "thought suppression". We show a similar approach can be used to find a vector that suppresses the model's reasoning process, allowing us to remove censorship by applying the negative multiples of this vector. Our code is publicly available at: https://github.com/hannahxchen/llm-censorship-steering
Related papers
- Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics [2.4839105527363574]
We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language Models refusal behaviour.<n>We show that it can remove political refusal behaviour while retaining safety alignment for harmful content.
arXiv Detail & Related papers (2025-12-18T14:43:04Z) - Are LLMs Good Safety Agents or a Propaganda Engine? [74.88607730071483]
PSP is a dataset built specifically to probe the refusal behaviors in Large Language Models from an explicitly political context.<n> PSP is built by formatting existing censored content from two data sources, openly available on the internet: sensitive prompts in China generalized to multiple countries, and tweets that have been censored in various countries.<n>We study: 1) impact of political sensitivity in seven LLMs through data-driven (making PSP implicit) and representation-level approaches (erasing the concept of politics); and, 2) vulnerability of models on PSP through prompt injection attacks (PIAs)
arXiv Detail & Related papers (2025-11-28T13:36:00Z) - R1dacted: Investigating Local Censorship in DeepSeek's R1 Language Model [17.402774424821814]
Reports suggest R1 refuses to answer certain prompts related to politically sensitive topics in China.<n>We introduce a large-scale set of heavily curated prompts that get censored by R1, but are not censored by other models.<n>We conduct a comprehensive analysis of R1's censorship patterns, examining their consistency, triggers, and variations across topics, prompt phrasing, and context.
arXiv Detail & Related papers (2025-05-19T02:16:56Z) - The Geometry of Self-Verification in a Task-Specific Reasoning Model [45.669264589017665]
We train a model using DeepSeek R1's recipe on the CountDown task.<n>We do a top-down and bottom-up analysis to reverse-engineer how the model verifies its outputs.
arXiv Detail & Related papers (2025-04-19T18:40:51Z) - Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior [59.20260988638777]
We demonstrate that prompting safety reflection before generating a response can mitigate false refusal behavior.<n>In an ablation study across 15 pre-trained models, we show that models fine-tuned with safety reflection significantly reduce false refusal behavior.
arXiv Detail & Related papers (2025-03-22T23:35:49Z) - CensorLab: A Testbed for Censorship Experimentation [15.411134921415567]
We design and implement CensorLab, a generic platform for emulating Internet censorship scenarios.
CensorLab aims to support all censorship mechanisms previously or currently deployed by real-world censors.
It provides an easy-to-use platform for researchers and practitioners enabling them to perform extensive experimentation.
arXiv Detail & Related papers (2024-12-20T21:17:24Z) - Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation [29.605302471407537]
Training a language model to be both helpful and harmless requires careful calibration of refusal behaviours.
We propose a simple and surgical method for mitigating false refusal in language models via single vector ablation.
Our approach is training-free and model-agnostic, making it useful for mitigating the problem of false refusal in current and future language models.
arXiv Detail & Related papers (2024-10-04T13:25:32Z) - Steering Without Side Effects: Improving Post-Deployment Control of Language Models [61.99293520621248]
Language models (LMs) have been shown to behave unexpectedly post-deployment.
We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits.
Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model.
arXiv Detail & Related papers (2024-06-21T01:37:39Z) - Refusal in Language Models Is Mediated by a Single Direction [4.532520427311685]
We show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size.
We propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities.
arXiv Detail & Related papers (2024-06-17T16:36:12Z) - Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries.
Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill.
We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z) - Amoeba: Circumventing ML-supported Network Censorship via Adversarial
Reinforcement Learning [8.788469979827484]
Recent advances in machine learning enable detecting a range of anti-censorship systems by learning distinct statistical patterns hidden in traffic flows.
In this paper, we formulate a practical adversarial attack strategy against flow classifiers as a method for circumventing censorship.
We show that Amoeba can effectively shape adversarial flows that have on average 94% attack success rate against a range of ML algorithms.
arXiv Detail & Related papers (2023-10-31T14:01:24Z) - Towards Robust Model Watermark via Reducing Parametric Vulnerability [57.66709830576457]
backdoor-based ownership verification becomes popular recently, in which the model owner can watermark the model.
We propose a mini-max formulation to find these watermark-removed models and recover their watermark behavior.
Our method improves the robustness of the model watermarking against parametric changes and numerous watermark-removal attacks.
arXiv Detail & Related papers (2023-09-09T12:46:08Z) - LLM Censorship: A Machine Learning Challenge or a Computer Security
Problem? [52.71988102039535]
We show that semantic censorship can be perceived as an undecidable problem.
We argue that the challenges extend beyond semantic censorship, as knowledgeable attackers can reconstruct impermissible outputs.
arXiv Detail & Related papers (2023-07-20T09:25:02Z) - LEACE: Perfect linear concept erasure in closed form [97.78661458934953]
Concept erasure aims to remove specified features from an embedding.
We introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which provably prevents all linear classifiers from detecting a concept while changing the embedding as little as possible.
We apply LEACE to large language models with a novel procedure called "concept scrubbing," which erases target concept information from every layer in the network.
arXiv Detail & Related papers (2023-06-06T16:07:24Z) - Augmenting Rule-based DNS Censorship Detection at Scale with Machine
Learning [38.00013408742201]
Censorship of the domain name system (DNS) is a key mechanism used across different countries.
In this paper, we explore how machine learning (ML) models can help streamline the detection process.
We find that unsupervised models, trained solely on uncensored instances, can identify new instances and variations of censorship missed by existing probes.
arXiv Detail & Related papers (2023-02-03T23:36:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.