Discovering Forbidden Topics in Language Models
- URL: http://arxiv.org/abs/2505.17441v3
- Date: Wed, 11 Jun 2025 16:52:27 GMT
- Title: Discovering Forbidden Topics in Language Models
- Authors: Can Rager, Chris Wendler, Rohit Gandikota, David Bau,
- Abstract summary: We develop a refusal discovery method that uses token prefilling to find forbidden topics.<n>We benchmark IPC on Tulu-3-8B, an open-source model with public safety tuning data.<n>Our findings highlight the critical need for refusal discovery methods to detect biases, boundaries, and alignment failures of AI systems.
- Score: 26.2418673687851
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Refusal discovery is the task of identifying the full set of topics that a language model refuses to discuss. We introduce this new problem setting and develop a refusal discovery method, Iterated Prefill Crawler (IPC), that uses token prefilling to find forbidden topics. We benchmark IPC on Tulu-3-8B, an open-source model with public safety tuning data. Our crawler manages to retrieve 31 out of 36 topics within a budget of 1000 prompts. Next, we scale the crawler to a frontier model using the prefilling option of Claude-Haiku. Finally, we crawl three widely used open-weight models: Llama-3.3-70B and two of its variants finetuned for reasoning: DeepSeek-R1-70B and Perplexity-R1-1776-70B. DeepSeek-R1-70B reveals patterns consistent with censorship tuning: The model exhibits "thought suppression" behavior that indicates memorization of CCP-aligned responses. Although Perplexity-R1-1776-70B is robust to censorship, IPC elicits CCP-aligned refusals answers in the quantized model. Our findings highlight the critical need for refusal discovery methods to detect biases, boundaries, and alignment failures of AI systems.
Related papers
- Assimilation Matters: Model-level Backdoor Detection in Vision-Language Pretrained Models [71.44858461725893]
Given a model fine-tuned by an untrusted third party, determining whether the model has been injected with a backdoor is a critical and challenging problem.<n>Existing detection methods usually rely on prior knowledge of training dataset, backdoor triggers and targets.<n>We introduce Assimilation Matters in DETection (AMDET), a novel model-level detection framework that operates without any such prior knowledge.
arXiv Detail & Related papers (2025-11-29T06:20:00Z) - HuggingR$^{4}$: A Progressive Reasoning Framework for Discovering Optimal Model Companions [50.61510609116118]
HuggingR$4$ is a novel framework that combines Reasoning, Retrieval, Refinement, and Reflection to efficiently select models.<n>It attains a workability rate of 92.03% and a reasonability rate of 82.46%, surpassing existing method by 26.51% and 33.25% respectively.
arXiv Detail & Related papers (2025-11-24T03:13:45Z) - Verifying LLM Inference to Prevent Model Weight Exfiltration [1.4698862238090828]
An attacker controlling an inference server may exfiltrate model weights by hiding them within ordinary model outputs.<n>This work investigates how to verify model responses to defend against such attacks and to detect anomalous or buggy behavior during inference.<n>We formalize model exfiltration as a security game, propose a verification framework that can provably mitigate steganographic exfiltration.
arXiv Detail & Related papers (2025-11-04T14:51:44Z) - RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models [43.76961935990733]
The ability of language models to refuse to answer based on flawed systems remains a significant failure point.<n>We introduce RefusalBench, a generative methodology that creates diagnostic test cases through controlled linguistic context.<n>We find that selective refusal is a train, alignmentsensitive capability offering a clear path to improvement.
arXiv Detail & Related papers (2025-10-12T00:53:42Z) - Understanding Refusal in Language Models with Sparse Autoencoders [27.212781538459588]
We use sparse autoencoders to identify latent features that causally mediate refusal behaviors.<n>We intervene on refusal-related features to assess their influence on generation.<n>This enables a fine-grained inspection of how refusal manifests at the activation level.
arXiv Detail & Related papers (2025-05-29T15:33:39Z) - R1dacted: Investigating Local Censorship in DeepSeek's R1 Language Model [17.402774424821814]
Reports suggest R1 refuses to answer certain prompts related to politically sensitive topics in China.<n>We introduce a large-scale set of heavily curated prompts that get censored by R1, but are not censored by other models.<n>We conduct a comprehensive analysis of R1's censorship patterns, examining their consistency, triggers, and variations across topics, prompt phrasing, and context.
arXiv Detail & Related papers (2025-05-19T02:16:56Z) - Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control [7.737740676767729]
We use representation engineering techniques to study open-weights safety-tuned models.<n>We present a method for finding a refusal-compliance vector that detects and controls the level of censorship in model outputs.<n>We show a similar approach can be used to find a vector that suppresses the model's reasoning process, allowing us to remove censorship by applying the negative multiples of this vector.
arXiv Detail & Related papers (2025-04-23T22:47:30Z) - Erasing Without Remembering: Implicit Knowledge Forgetting in Large Language Models [70.78205685001168]
We investigate knowledge forgetting in large language models with a focus on its generalisation.<n> UGBench is the first benchmark specifically designed to assess the unlearning of in-scope implicit knowledge.<n>We propose PerMU, a novel probability-based unlearning paradigm.
arXiv Detail & Related papers (2025-02-27T11:03:33Z) - Riddle Me This! Stealthy Membership Inference for Retrieval-Augmented Generation [18.098228823748617]
We present Interrogation Attack (IA), a membership inference technique targeting documents in the RAG datastore.<n>We demonstrate successful inference with just 30 queries while remaining stealthy.<n>We observe a 2x improvement in TPR@1%FPR over prior inference attacks across diverse RAG configurations.
arXiv Detail & Related papers (2025-02-01T04:01:18Z) - Practical Continual Forgetting for Pre-trained Vision Models [61.41125567026638]
In real-world scenarios, selective information is expected to be continuously removed from a pre-trained model.<n>We define this problem as continual forgetting and identify three key challenges.<n>We first propose Group Sparse LoRA (GS-LoRA) to fine-tune the FFN layers in Transformer blocks for each forgetting task.<n>We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that our method manages to forget specific classes with minimal impact on other classes.
arXiv Detail & Related papers (2025-01-16T17:57:53Z) - Attacking Misinformation Detection Using Adversarial Examples Generated by Language Models [0.0]
We investigate the challenge of generating adversarial examples to test the robustness of text classification algorithms.
We focus on simulation of content moderation by setting realistic limits on the number of queries an attacker is allowed to attempt.
arXiv Detail & Related papers (2024-10-28T11:46:30Z) - OR-Bench: An Over-Refusal Benchmark for Large Language Models [65.34666117785179]
Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs.<n>This study proposes a novel method for automatically generating large-scale over-refusal datasets.<n>We introduce OR-Bench, the first large-scale over-refusal benchmark.
arXiv Detail & Related papers (2024-05-31T15:44:33Z) - Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal.
Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths.
Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z) - AttributionBench: How Hard is Automatic Attribution Evaluation? [19.872081697282002]
We present AttributionBench, a comprehensive benchmark compiled from various existing attribution datasets.
Our experiments show that even a fine-tuned GPT-3.5 only achieves around 80% macro-F1 under a binary classification formulation.
A detailed analysis of more than 300 error cases indicates that a majority of failures stem from the model's inability to process nuanced information.
arXiv Detail & Related papers (2024-02-23T04:23:33Z) - Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries.
Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill.
We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z) - Look Before You Leap: A Universal Emergent Decomposition of Retrieval
Tasks in Language Models [58.57279229066477]
We study how language models (LMs) solve retrieval tasks in diverse situations.
We introduce ORION, a collection of structured retrieval tasks spanning six domains.
We find that LMs internally decompose retrieval tasks in a modular way.
arXiv Detail & Related papers (2023-12-13T18:36:43Z) - Make Them Spill the Beans! Coercive Knowledge Extraction from
(Production) LLMs [31.80386572346993]
We exploit the fact that even when an LLM rejects a toxic request, a harmful response often hides deep in the output logits.
This approach differs from and outperforms jail-breaking methods, achieving 92% effectiveness compared to 62%, and is 10 to 20 times faster.
Our findings indicate that interrogation can extract toxic knowledge even from models specifically designed for coding tasks.
arXiv Detail & Related papers (2023-12-08T01:41:36Z) - From Chaos to Clarity: Claim Normalization to Empower Fact-Checking [57.024192702939736]
Claim Normalization (aka ClaimNorm) aims to decompose complex and noisy social media posts into more straightforward and understandable forms.
We propose CACN, a pioneering approach that leverages chain-of-thought and claim check-worthiness estimation.
Our experiments demonstrate that CACN outperforms several baselines across various evaluation measures.
arXiv Detail & Related papers (2023-10-22T16:07:06Z) - CAR: Conceptualization-Augmented Reasoner for Zero-Shot Commonsense
Question Answering [56.592385613002584]
We propose Conceptualization-Augmented Reasoner (CAR) to tackle the task of zero-shot commonsense question answering.
CAR abstracts a commonsense knowledge triple to many higher-level instances, which increases the coverage of CommonSense Knowledge Bases.
CAR more robustly generalizes to answering questions about zero-shot commonsense scenarios than existing methods.
arXiv Detail & Related papers (2023-05-24T08:21:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.