Related papers: HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router

HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router

URL: http://arxiv.org/abs/2410.02684v1
Date: Thu, 3 Oct 2024 17:10:41 GMT
Title: HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router
Authors: Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Ruibin Yuan, Xueqi Cheng,
Abstract summary: We introduce HiddenGuard, a novel framework for fine-grained, safe generation in Large Language Models. HiddenGuard incorporates Prism, which operates alongside the LLM to enable real-time, token-level detection and redaction of harmful content. Our experiments demonstrate that HiddenGuard achieves over 90% in F1 score for detecting and redacting harmful content.
Score: 42.222681564769076
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As Large Language Models (LLMs) grow increasingly powerful, ensuring their safety and alignment with human values remains a critical challenge. Ideally, LLMs should provide informative responses while avoiding the disclosure of harmful or sensitive information. However, current alignment approaches, which rely heavily on refusal strategies, such as training models to completely reject harmful prompts or applying coarse filters are limited by their binary nature. These methods either fully deny access to information or grant it without sufficient nuance, leading to overly cautious responses or failures to detect subtle harmful content. For example, LLMs may refuse to provide basic, public information about medication due to misuse concerns. Moreover, these refusal-based methods struggle to handle mixed-content scenarios and lack the ability to adapt to context-dependent sensitivities, which can result in over-censorship of benign content. To overcome these challenges, we introduce HiddenGuard, a novel framework for fine-grained, safe generation in LLMs. HiddenGuard incorporates Prism (rePresentation Router for In-Stream Moderation), which operates alongside the LLM to enable real-time, token-level detection and redaction of harmful content by leveraging intermediate hidden states. This fine-grained approach allows for more nuanced, context-aware moderation, enabling the model to generate informative responses while selectively redacting or replacing sensitive information, rather than outright refusal. We also contribute a comprehensive dataset with token-level fine-grained annotations of potentially harmful information across diverse contexts. Our experiments demonstrate that HiddenGuard achieves over 90% in F1 score for detecting and redacting harmful content while preserving the overall utility and informativeness of the model's responses.

Related papers

Transferable Adversarial Attacks on Black-Box Vision-Language Models [63.22532779621001]
adversarial attacks can transfer from open-source to proprietary black-box models in text-only and vision-only contexts.<n>We show that attackers can craft perturbations to induce specific attacker-chosen interpretations of visual information.<n>We discover that universal perturbations -- modifications applicable to a wide set of images -- can consistently induce these misinterpretations.
arXiv Detail & Related papers (2025-05-02T06:51:11Z)
Anti-adversarial Learning: Desensitizing Prompts for Large Language Models [13.674984661911607]
We propose PromptObfus, a novel method for desensitizing LLM prompts.<n>The core idea of PromptObfus is "anti-adversarial" learning, which perturbs privacy words in the prompt to obscure sensitive information.<n>We show that PromptObfus effectively prevents privacy inference from remote LLMs while preserving task performance.
arXiv Detail & Related papers (2025-04-25T06:19:02Z)
Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-tuning [23.71517734919702]
Vision-language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs. Current alignment strategies rely on supervised safety fine-tuning with curated datasets. We show that supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses.
arXiv Detail & Related papers (2025-03-14T19:52:08Z)
BingoGuard: LLM Content Moderation Tools with Risk Levels [67.53167973090356]
Malicious content generated by large language models (LLMs) can pose varying degrees of harm. In this paper, we introduce per-topic severity rubrics for 11 harmful topics and build BingoGuard, an LLM-based moderation system.
arXiv Detail & Related papers (2025-03-09T10:43:09Z)
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM [53.79753074854936]
Large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks. This vulnerability poses significant risks to real-world applications. We propose a novel defensive paradigm called GuidelineLLM.
arXiv Detail & Related papers (2024-12-10T12:42:33Z)
LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts [88.96201324719205]
Safety concerns in large language models (LLMs) have gained significant attention due to their exposure to potentially harmful data during pre-training.<n>We identify a new safety vulnerability in LLMs, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms.<n>We introduce a novel attack method, textitActorBreaker, which identifies actors related to toxic prompts within pre-training distribution.
arXiv Detail & Related papers (2024-10-14T16:41:49Z)
VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data [29.806775884883685]
VLMGuard is a novel learning framework that leverages the unlabeled user prompts in the wild for malicious prompt detection. We present an automated maliciousness estimation score for distinguishing between benign and malicious samples. Our framework does not require extra human annotations, offering strong flexibility and practicality for real-world applications.
arXiv Detail & Related papers (2024-10-01T00:37:29Z)
ShieldGemma: Generative AI Content Moderation Based on Gemma [49.91147965876678]
ShieldGemma is a suite of safety content moderation models built upon Gemma2. Models provide robust, state-of-the-art predictions of safety risks across key harm types.
arXiv Detail & Related papers (2024-07-31T17:48:14Z)
Protecting Your LLMs with Information Bottleneck [20.870610473199125]
We introduce the Information Bottleneck Protector (IBProtector), a defense mechanism grounded in the information bottleneck principle. The IBProtector selectively compresses and perturbs prompts, facilitated by a lightweight and trainable extractor. Our empirical evaluations show that IBProtector outperforms current defense methods in mitigating jailbreak attempts.
arXiv Detail & Related papers (2024-04-22T08:16:07Z)
Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content [33.99403318079253]
Even safe text coming from large language models can be turned into potentially dangerous content through Bait-and-Switch attacks. The alarming efficacy of this approach highlights a significant challenge in developing reliable safety guardrails for LLMs.
arXiv Detail & Related papers (2024-02-21T16:46:36Z)
On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction. Inspired by these findings, we propose a method for safety prompt optimization, namely DRO. Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z)
A Survey on Detection of LLMs-Generated Content [97.87912800179531]
The ability to detect LLMs-generated content has become of paramount importance. We aim to provide a detailed overview of existing detection strategies and benchmarks. We also posit the necessity for a multi-faceted approach to defend against various attacks.
arXiv Detail & Related papers (2023-10-24T09:10:26Z)
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks [73.53327403684676]
We propose an attack-and-defense framework for studying the task of deleting sensitive information directly from model weights. We study direct edits to model weights because this approach should guarantee that particular deleted information is never extracted by future prompt attacks. We show that even state-of-the-art model editing methods such as ROME struggle to truly delete factual information from models like GPT-J, as our whitebox and blackbox attacks can recover "deleted" information from an edited model 38% of the time.
arXiv Detail & Related papers (2023-09-29T17:12:43Z)
Knowledge Sanitization of Large Language Models [4.722882736419499]
Large language models (LLMs) trained on a large corpus of Web data can potentially reveal sensitive or confidential information. Our technique efficiently fine-tunes these models using the Low-Rank Adaptation (LoRA) method. Experimental results in a closed-book question-answering task show that our straightforward method not only minimizes particular knowledge leakage but also preserves the overall performance of LLMs.
arXiv Detail & Related papers (2023-09-21T07:49:55Z)
LLM Censorship: A Machine Learning Challenge or a Computer Security Problem? [52.71988102039535]
We show that semantic censorship can be perceived as an undecidable problem. We argue that the challenges extend beyond semantic censorship, as knowledgeable attackers can reconstruct impermissible outputs.
arXiv Detail & Related papers (2023-07-20T09:25:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.