Related papers: A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI Responses

A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI Responses

URL: http://arxiv.org/abs/2407.02551v1
Date: Tue, 2 Jul 2024 16:19:25 GMT
Title: A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI Responses
Authors: David Glukhov, Ziwen Han, Ilia Shumailov, Vardan Papyan, Nicolas Papernot,
Abstract summary: Large Language Models (LLMs) are vulnerable to leakages at$x2013$x2013$methods. We introduce an inferential threat model called inferential adversaries who exploit impermissible information to achieve malicious goals. Our work provides the first theoretically grounded understanding of the requirements for releasing safe jailbreaks and the utility costs involved.
Score: 42.136793654338106
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are vulnerable to jailbreaks$\unicode{x2013}$methods to elicit harmful or generally impermissible outputs. Safety measures are developed and assessed on their effectiveness at defending against jailbreak attacks, indicating a belief that safety is equivalent to robustness. We assert that current defense mechanisms, such as output filters and alignment fine-tuning, are, and will remain, fundamentally insufficient for ensuring model safety. These defenses fail to address risks arising from dual-intent queries and the ability to composite innocuous outputs to achieve harmful goals. To address this critical gap, we introduce an information-theoretic threat model called inferential adversaries who exploit impermissible information leakage from model outputs to achieve malicious goals. We distinguish these from commonly studied security adversaries who only seek to force victim models to generate specific impermissible outputs. We demonstrate the feasibility of automating inferential adversaries through question decomposition and response aggregation. To provide safety guarantees, we define an information censorship criterion for censorship mechanisms, bounding the leakage of impermissible information. We propose a defense mechanism which ensures this bound and reveal an intrinsic safety-utility trade-off. Our work provides the first theoretically grounded understanding of the requirements for releasing safe LLMs and the utility costs involved.

Related papers

ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning [49.47193675702453]
Large Language Models (LLMs) have demonstrated remarkable generative capabilities.<n>LLMs remain vulnerable to malicious instructions that can bypass safety constraints.<n>We propose a reasoning-based safety alignment framework, ARMOR, that replaces the ad-hoc chains of thought reasoning process with human-aligned, structured one.
arXiv Detail & Related papers (2025-07-14T09:05:54Z)
The Safety Reminder: A Soft Prompt to Reactivate Delayed Safety Awareness in Vision-Language Models [4.27794555931853]
Vision-Language Models (VLMs) face unique vulnerabilities due to their multimodal nature, allowing adversaries to bypass safety guardrails and trigger the generation of harmful content.<n>We propose The Safety Reminder'', a soft prompt tuning approach that optimize learnable prompt tokens, which are periodically injected during the text generation process to enhance safety awareness.
arXiv Detail & Related papers (2025-06-15T12:48:38Z)
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z)
Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense [90.71884758066042]
Large vision-language models (LVLMs) introduce a unique vulnerability: susceptibility to malicious attacks via visual inputs. We propose ESIII (Embedding Security Instructions Into Images), a novel methodology for transforming the visual space from a source of vulnerability into an active defense mechanism.
arXiv Detail & Related papers (2025-03-14T17:39:45Z)
Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing [12.986006070964772]
Safety alignment is an essential research topic for real-world AI applications. Our study first identified the difficulty of eliminating such vulnerabilities without sacrificing the model's helpfulness. Our method could enhance the model's helpfulness while maintaining safety, thus improving the trade-off-front.
arXiv Detail & Related papers (2025-02-04T09:31:54Z)
On the Robustness of Adversarial Training Against Uncertainty Attacks [9.180552487186485]
In learning problems, the noise inherent to the task at hand hinders the possibility to infer without a certain degree of uncertainty. In this work, we reveal both empirically and theoretically that defending against adversarial examples, i.e., carefully perturbed samples that cause misclassification, guarantees a more secure, trustworthy uncertainty estimate. To support our claims, we evaluate multiple adversarial-robust models from the publicly available benchmark RobustBench on the CIFAR-10 and ImageNet datasets.
arXiv Detail & Related papers (2024-10-29T11:12:44Z)
LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks [7.013820690538764]
We study attacks that exploit the emphfalse negatives of safeguard methods. The malicious attackers could also exploit false positives of safeguards, leading to a denial-of-service (DoS) affecting users.
arXiv Detail & Related papers (2024-10-03T19:07:53Z)
Evaluating Defences against Unsafe Feedback in RLHF [26.872318173182414]
This paper looks at learning from unsafe feedback with reinforcement learning. We find that safety-aligned LLMs easily explore unsafe action spaces via generating harmful text. In order to protect against this vulnerability, we adapt a number of both "implict" and "explicit" harmful fine-tuning defences.
arXiv Detail & Related papers (2024-09-19T17:10:34Z)
Nothing in Excess: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large language models (LLMs) to defend threats from malicious instructions. Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue. We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
Purple-teaming LLMs with Adversarial Defender Training [57.535241000787416]
We present Purple-teaming LLMs with Adversarial Defender training (PAD) PAD is a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques. PAD significantly outperforms existing baselines in both finding effective attacks and establishing a robust safe guardrail.
arXiv Detail & Related papers (2024-07-01T23:25:30Z)
The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications. This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z)
Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield [7.5520641322945785]
Large Language Models' safety remains a critical concern due to their vulnerability to adversarial attacks. We introduce the Adversarial Prompt Shield (APS), a lightweight model that excels in detection accuracy and demonstrates resilience against adversarial prompts. We also propose novel strategies for autonomously generating adversarial training datasets.
arXiv Detail & Related papers (2023-10-31T22:22:10Z)
Kick Bad Guys Out! Conditionally Activated Anomaly Detection in Federated Learning with Zero-Knowledge Proof Verification [22.078088272837068]
Federated Learning (FL) systems are susceptible to adversarial attacks. Current defense methods are often impractical for real-world FL systems. We propose a novel anomaly detection strategy that is designed for real-world FL systems.
arXiv Detail & Related papers (2023-10-06T07:09:05Z)
Foveate, Attribute, and Rationalize: Towards Physically Safe and Trustworthy AI [76.28956947107372]
Covertly unsafe text is an area of particular interest, as such text may arise from everyday scenarios and are challenging to detect as harmful. We propose FARM, a novel framework leveraging external knowledge for trustworthy rationale generation in the context of safety. Our experiments show that FARM obtains state-of-the-art results on the SafeText dataset, showing absolute improvement in safety classification accuracy by 5.9%.
arXiv Detail & Related papers (2022-12-19T17:51:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.