HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router
- URL: http://arxiv.org/abs/2410.02684v1
- Date: Thu, 3 Oct 2024 17:10:41 GMT
- Title: HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router
- Authors: Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Ruibin Yuan, Xueqi Cheng,
- Abstract summary: We introduce HiddenGuard, a novel framework for fine-grained, safe generation in Large Language Models.
HiddenGuard incorporates Prism, which operates alongside the LLM to enable real-time, token-level detection and redaction of harmful content.
Our experiments demonstrate that HiddenGuard achieves over 90% in F1 score for detecting and redacting harmful content.
- Score: 42.222681564769076
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As Large Language Models (LLMs) grow increasingly powerful, ensuring their safety and alignment with human values remains a critical challenge. Ideally, LLMs should provide informative responses while avoiding the disclosure of harmful or sensitive information. However, current alignment approaches, which rely heavily on refusal strategies, such as training models to completely reject harmful prompts or applying coarse filters are limited by their binary nature. These methods either fully deny access to information or grant it without sufficient nuance, leading to overly cautious responses or failures to detect subtle harmful content. For example, LLMs may refuse to provide basic, public information about medication due to misuse concerns. Moreover, these refusal-based methods struggle to handle mixed-content scenarios and lack the ability to adapt to context-dependent sensitivities, which can result in over-censorship of benign content. To overcome these challenges, we introduce HiddenGuard, a novel framework for fine-grained, safe generation in LLMs. HiddenGuard incorporates Prism (rePresentation Router for In-Stream Moderation), which operates alongside the LLM to enable real-time, token-level detection and redaction of harmful content by leveraging intermediate hidden states. This fine-grained approach allows for more nuanced, context-aware moderation, enabling the model to generate informative responses while selectively redacting or replacing sensitive information, rather than outright refusal. We also contribute a comprehensive dataset with token-level fine-grained annotations of potentially harmful information across diverse contexts. Our experiments demonstrate that HiddenGuard achieves over 90% in F1 score for detecting and redacting harmful content while preserving the overall utility and informativeness of the model's responses.
Related papers
- VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data [29.806775884883685]
VLMGuard is a novel learning framework that leverages the unlabeled user prompts in the wild for malicious prompt detection.
We present an automated maliciousness estimation score for distinguishing between benign and malicious samples.
Our framework does not require extra human annotations, offering strong flexibility and practicality for real-world applications.
arXiv Detail & Related papers (2024-10-01T00:37:29Z) - ShieldGemma: Generative AI Content Moderation Based on Gemma [49.91147965876678]
ShieldGemma is a suite of safety content moderation models built upon Gemma2.
Models provide robust, state-of-the-art predictions of safety risks across key harm types.
arXiv Detail & Related papers (2024-07-31T17:48:14Z) - Protecting Your LLMs with Information Bottleneck [20.870610473199125]
We introduce the Information Bottleneck Protector (IBProtector), a defense mechanism grounded in the information bottleneck principle.
The IBProtector selectively compresses and perturbs prompts, facilitated by a lightweight and trainable extractor.
Our empirical evaluations show that IBProtector outperforms current defense methods in mitigating jailbreak attempts.
arXiv Detail & Related papers (2024-04-22T08:16:07Z) - Large Language Models are Vulnerable to Bait-and-Switch Attacks for
Generating Harmful Content [33.99403318079253]
Even safe text coming from large language models can be turned into potentially dangerous content through Bait-and-Switch attacks.
The alarming efficacy of this approach highlights a significant challenge in developing reliable safety guardrails for LLMs.
arXiv Detail & Related papers (2024-02-21T16:46:36Z) - On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction.
Inspired by these findings, we propose a method for safety prompt optimization, namely DRO.
Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z) - A Survey on Detection of LLMs-Generated Content [97.87912800179531]
The ability to detect LLMs-generated content has become of paramount importance.
We aim to provide a detailed overview of existing detection strategies and benchmarks.
We also posit the necessity for a multi-faceted approach to defend against various attacks.
arXiv Detail & Related papers (2023-10-24T09:10:26Z) - Can Sensitive Information Be Deleted From LLMs? Objectives for Defending
Against Extraction Attacks [73.53327403684676]
We propose an attack-and-defense framework for studying the task of deleting sensitive information directly from model weights.
We study direct edits to model weights because this approach should guarantee that particular deleted information is never extracted by future prompt attacks.
We show that even state-of-the-art model editing methods such as ROME struggle to truly delete factual information from models like GPT-J, as our whitebox and blackbox attacks can recover "deleted" information from an edited model 38% of the time.
arXiv Detail & Related papers (2023-09-29T17:12:43Z) - Knowledge Sanitization of Large Language Models [4.722882736419499]
Large language models (LLMs) trained on a large corpus of Web data can potentially reveal sensitive or confidential information.
Our technique efficiently fine-tunes these models using the Low-Rank Adaptation (LoRA) method.
Experimental results in a closed-book question-answering task show that our straightforward method not only minimizes particular knowledge leakage but also preserves the overall performance of LLMs.
arXiv Detail & Related papers (2023-09-21T07:49:55Z) - LLM Censorship: A Machine Learning Challenge or a Computer Security
Problem? [52.71988102039535]
We show that semantic censorship can be perceived as an undecidable problem.
We argue that the challenges extend beyond semantic censorship, as knowledgeable attackers can reconstruct impermissible outputs.
arXiv Detail & Related papers (2023-07-20T09:25:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.