Related papers: Certifying LLM Safety against Adversarial Prompting

Certifying LLM Safety against Adversarial Prompting

URL: http://arxiv.org/abs/2309.02705v4
Date: Tue, 04 Feb 2025 19:47:09 GMT
Title: Certifying LLM Safety against Adversarial Prompting
Authors: Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron Jiaxun Li, Soheil Feizi, Himabindu Lakkaraju,
Abstract summary: Large language models (LLMs) are vulnerable to adversarial attacks that add malicious tokens to an input prompt.<n>We introduce erase-and-check, the first framework for defending against adversarial prompts with certifiable safety guarantees.
Score: 70.96868018621167
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are vulnerable to adversarial attacks that add malicious tokens to an input prompt to bypass the safety guardrails of an LLM and cause it to produce harmful content. In this work, we introduce erase-and-check, the first framework for defending against adversarial prompts with certifiable safety guarantees. Given a prompt, our procedure erases tokens individually and inspects the resulting subsequences using a safety filter. Our safety certificate guarantees that harmful prompts are not mislabeled as safe due to an adversarial attack up to a certain size. We implement the safety filter in two ways, using Llama 2 and DistilBERT, and compare the performance of erase-and-check for the two cases. We defend against three attack modes: i) adversarial suffix, where an adversarial sequence is appended at the end of a harmful prompt; ii) adversarial insertion, where the adversarial sequence is inserted anywhere in the middle of the prompt; and iii) adversarial infusion, where adversarial tokens are inserted at arbitrary positions in the prompt, not necessarily as a contiguous block. Our experimental results demonstrate that this procedure can obtain strong certified safety guarantees on harmful prompts while maintaining good empirical performance on safe prompts. Additionally, we propose three efficient empirical defenses: i) RandEC, a randomized subsampling version of erase-and-check; ii) GreedyEC, which greedily erases tokens that maximize the softmax score of the harmful class; and iii) GradEC, which uses gradient information to optimize tokens to erase. We demonstrate their effectiveness against adversarial prompts generated by the Greedy Coordinate Gradient (GCG) attack algorithm. The code for our experiments is available at https://github.com/aounon/certified-llm-safety.

Related papers

Towards Copyright Protection for Knowledge Bases of Retrieval-augmented Language Models via Ownership Verification with Reasoning [58.57194301645823]
Large language models (LLMs) are increasingly integrated into real-world applications through retrieval-augmented generation (RAG) mechanisms. Existing methods that can be generalized as watermarking techniques to protect these knowledge bases typically involve poisoning attacks. We propose name for harmless' copyright protection of knowledge bases.
arXiv Detail & Related papers (2025-02-10T09:15:56Z)
The TIP of the Iceberg: Revealing a Hidden Class of Task-in-Prompt Adversarial Attacks on LLMs [1.9424018922013224]
We present a novel class of jailbreak adversarial attacks on LLMs. Our approach embeds sequence-to-sequence tasks into the model's prompt to indirectly generate prohibited inputs. We demonstrate that our techniques successfully circumvent safeguards in six state-of-the-art language models.
arXiv Detail & Related papers (2025-01-27T12:48:47Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks. We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets. Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
Enhancing Adversarial Attacks through Chain of Thought [0.0]
gradient-based adversarial attacks are particularly effective against aligned large language models (LLMs) This paper proposes enhancing the universality of adversarial attacks by integrating CoT prompts with the greedy coordinate gradient (GCG) technique.
arXiv Detail & Related papers (2024-10-29T06:54:00Z)
LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks [7.013820690538764]
We study attacks that exploit the emphfalse negatives of safeguard methods. The malicious attackers could also exploit false positives of safeguards, leading to a denial-of-service (DoS) affecting users.
arXiv Detail & Related papers (2024-10-03T19:07:53Z)
Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders [5.070104802923903]
Unsafe prompts pose a significant threat to Large Language Models (LLMs) This paper investigates the potential of sentence encoders to distinguish safe from unsafe prompts. We introduce new pairwise datasets and the Categorical Purity metric to measure this capability.
arXiv Detail & Related papers (2024-07-09T13:35:54Z)
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs [51.217126257318924]
We present a novel method that uses another Large Language Models, called the AdvPrompter, to generate human-readable adversarial prompts in seconds. We train the AdvPrompter using a novel algorithm that does not require access to the gradients of the TargetLLM. The trained AdvPrompter generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response.
arXiv Detail & Related papers (2024-04-21T22:18:13Z)
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks. We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z)
ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text. Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z)
Fight Back Against Jailbreaking via Prompt Adversarial Tuning [23.55544992740663]
Large Language Models (LLMs) are susceptible to jailbreaking attacks. We propose an approach named Prompt Adversarial Tuning (PAT) that trains a prompt control attached to the user prompt as a guard prefix. Our method is effective against both grey-box and black-box attacks, reducing the success rate of advanced attacks to nearly 0%.
arXiv Detail & Related papers (2024-02-09T09:09:39Z)
On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction. Inspired by these findings, we propose a method for safety prompt optimization, namely DRO. Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z)
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks. This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs. We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z)
An LLM can Fool Itself: A Prompt-Based Adversarial Attack [26.460067102821476]
This paper proposes an efficient tool to audit the LLM's adversarial robustness via a prompt-based adversarial attack (PromptAttack) PromptAttack converts adversarial textual attacks into an attack prompt that can cause the victim LLM to output the adversarial sample to fool itself. Comprehensive empirical results using Llama2 and GPT-3.5 validate that PromptAttack consistently yields a much higher attack success rate compared to AdvGLUE and AdvGLUE++.
arXiv Detail & Related papers (2023-10-20T08:16:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.