Related papers: Single-pass Detection of Jailbreaking Input in Large Language Models

Single-pass Detection of Jailbreaking Input in Large Language Models

URL: http://arxiv.org/abs/2502.15435v1
Date: Fri, 21 Feb 2025 13:04:13 GMT
Title: Single-pass Detection of Jailbreaking Input in Large Language Models
Authors: Leyla Naz Candogan, Yongtao Wu, Elias Abad Rocamora, Grigorios G. Chrysos, Volkan Cevher,
Abstract summary: Defending aligned Large Language Models (LLMs) against jailbreaking attacks is a challenging problem.<n>We focus on detecting jailbreaking input in a single forward pass.<n>Our method, called Single Pass Detection SPD, leverages the information carried by the logits to predict whether the output sentence will be harmful.
Score: 48.384044012457984
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Defending aligned Large Language Models (LLMs) against jailbreaking attacks is a challenging problem, with existing approaches requiring multiple requests or even queries to auxiliary LLMs, making them computationally heavy. Instead, we focus on detecting jailbreaking input in a single forward pass. Our method, called Single Pass Detection SPD, leverages the information carried by the logits to predict whether the output sentence will be harmful. This allows us to defend in just one forward pass. SPD can not only detect attacks effectively on open-source models, but also minimizes the misclassification of harmless inputs. Furthermore, we show that SPD remains effective even without complete logit access in GPT-3.5 and GPT-4. We believe that our proposed method offers a promising approach to efficiently safeguard LLMs against adversarial attacks.

Related papers

Bag of Tricks for Subverting Reasoning-based Safety Guardrails [62.139297207938036]
We present a bag of jailbreak methods that subvert the reasoning-based guardrails.<n>Our attacks span white-, gray-, and black-box settings and range from effortless template manipulations to fully automated optimization.
arXiv Detail & Related papers (2025-10-13T16:16:44Z)
Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models [69.11679786018206]
Supervised fine-tuning (SFT) aligns large language models with human intent by training them on labeled task-specific data.<n>Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer pairs.<n>We propose a novel clean-data backdoor attack for jailbreaking LLMs.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary [2.4329261266984346]
Large Language Models (LLMs) are designed to generate helpful and safe content. adversarial attacks, commonly referred to as jailbreak, can bypass their safety protocols. We introduce a novel jailbreak attack method that leverages the prefilling feature of LLMs.
arXiv Detail & Related papers (2025-04-28T07:38:43Z)
An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks.<n>Our threat model checks if a given jailbreak is likely to occur in the distribution of text.<n>We adapt popular attacks to this threat model, and, for the first time, benchmark these attacks on equal footing with it.
arXiv Detail & Related papers (2024-10-21T17:27:01Z)
Denial-of-Service Poisoning Attacks against Large Language Models [64.77355353440691]
LLMs are vulnerable to denial-of-service (DoS) attacks, where spelling errors or non-semantic prompts trigger endless outputs without generating an [EOS] token. We propose poisoning-based DoS attacks for LLMs, demonstrating that injecting a single poisoned sample designed for DoS purposes can break the output length limit.
arXiv Detail & Related papers (2024-10-14T17:39:31Z)
AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs [34.221522224051846]
We propose an adaptive position pre-fill jailbreak attack approach for executing jailbreak attacks on Large Language Models (LLMs) Our method leverages the model's instruction-following capabilities to first output safe content, then exploits its narrative-shifting abilities to generate harmful content. Our method can improve the attack success rate by 47% on the widely recognized secure model (Llama2) compared to existing approaches.
arXiv Detail & Related papers (2024-09-11T00:00:58Z)
Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks [27.11523234556414]
We propose a plug-and-play and easy-to-deploy jailbreak defense framework, namely Prefix Guidance (PG) PG guides the model to identify harmful prompts by directly setting the first few tokens of the model's output. We demonstrate the effectiveness of PG across three models and five attack methods.
arXiv Detail & Related papers (2024-08-15T14:51:32Z)
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs) Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)
ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text. Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z)
Fight Back Against Jailbreaking via Prompt Adversarial Tuning [23.55544992740663]
Large Language Models (LLMs) are susceptible to jailbreaking attacks. We propose an approach named Prompt Adversarial Tuning (PAT) that trains a prompt control attached to the user prompt as a guard prefix. Our method is effective against both grey-box and black-box attacks, reducing the success rate of advanced attacks to nearly 0%.
arXiv Detail & Related papers (2024-02-09T09:09:39Z)
Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks. Existing jailbreaking methods are computationally costly. We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z)
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs) Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.