Related papers: Imperceptible Jailbreaking against Large Language Models

Imperceptible Jailbreaking against Large Language Models

URL: http://arxiv.org/abs/2510.05025v1
Date: Mon, 06 Oct 2025 17:03:50 GMT
Title: Imperceptible Jailbreaking against Large Language Models
Authors: Kuofeng Gao, Yiming Li, Chao Du, Xin Wang, Xingjun Ma, Shu-Tao Xia, Tianyu Pang,
Abstract summary: We introduce imperceptible jailbreaks that exploit a class of Unicode characters called variation selectors.<n>By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen.<n>We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses.
Score: 107.76039200173528
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Jailbreaking attacks on the vision modality typically rely on imperceptible adversarial perturbations, whereas attacks on the textual modality are generally assumed to require visible modifications (e.g., non-semantic suffixes). In this paper, we introduce imperceptible jailbreaks that exploit a class of Unicode characters called variation selectors. By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen, while their tokenization is "secretly" altered. We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses. Our experiments show that our imperceptible jailbreaks achieve high attack success rates against four aligned LLMs and generalize to prompt injection attacks, all without producing any visible modifications in the written prompt. Our code is available at https://github.com/sail-sg/imperceptible-jailbreaks.

Related papers

Say It Differently: Linguistic Styles as Jailbreak Vectors [0.763334557068953]
We study how linguistic styles such as fear or curiosity can reframe harmful intent and elicit unsafe responses from aligned models.<n>We construct style-augmented jailbreak benchmark by transforming prompts from 3 standard datasets into 11 distinct linguistic styles.<n>Styles such as fearful, curious and compassionate are most effective and contextualized rewrites outperform templated variants.
arXiv Detail & Related papers (2025-11-13T17:24:38Z)
LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback [31.15245650762331]
We propose LatentBreak, a white-box jailbreak attack that generates natural adversarial prompts with low perplexity.<n>LatentBreak substitutes words in the input prompt with semantically-equivalent ones, preserving the initial intent of the prompt.<n>Our evaluation shows that LatentBreak leads to shorter and low-perplexity prompts, thus outperforming competing jailbreak algorithms.
arXiv Detail & Related papers (2025-10-07T09:40:20Z)
NLP Methods for Detecting Novel LLM Jailbreaks and Keyword Analysis with BERT [3.2654923574107357]
Large Language Models (LLMs) suffer from a range of vulnerabilities that allow malicious users to solicit undesirable responses through manipulation of the input text.<n>These so-called jailbreak prompts are designed to trick the LLM into circumventing the safety guardrails put in place to keep responses acceptable to the developer's policies.<n>In this study, we analyse the ability of different machine learning models to distinguish jailbreak prompts from genuine uses.
arXiv Detail & Related papers (2025-10-02T03:55:29Z)
Anyone Can Jailbreak: Prompt-Based Attacks on LLMs and T2Is [8.214994509812724]
Large language models (LLMs) and text-to-image (T2I) systems remain vulnerable to prompt-based attacks known as jailbreaks.<n>This paper presents a systems-style investigation into how non-experts reliably circumvent safety mechanisms.<n>We propose a unified taxonomy of prompt-level jailbreak strategies spanning both text-output and T2I models.
arXiv Detail & Related papers (2025-07-29T13:55:23Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains [0.0]
This paper introduces SequentialBreak, a novel jailbreak attack that exploits a vulnerability in Large Language Models (LLMs)<n>We discuss several scenarios, not limited to examples like Question Bank, Dialog Completion, and Game Environment, where the harmful prompt is embedded within benign ones that can fool LLMs into generating harmful responses.<n> Extensive experiments demonstrate that SequentialBreak uses only a single query to achieve a substantial gain of attack success rate.
arXiv Detail & Related papers (2024-11-10T11:08:28Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.<n>It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.<n>Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger [67.75420257197186]
In this work, we propose $textbfBaThe, a simple yet effective jailbreak defense mechanism.<n>Jailbreak backdoor attack uses harmful instructions combined with manually crafted strings as triggers to make the backdoored model generate prohibited responses.<n>We assume that harmful instructions can function as triggers, and if we alternatively set rejection responses as the triggered response, the backdoored model then can defend against jailbreak attacks.
arXiv Detail & Related papers (2024-08-17T04:43:26Z)
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively. In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z)
LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper [16.078682415975337]
Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs) This paper proposes a lightweight yet practical defense called SELFDEFEND. It can defend against all existing jailbreak attacks with minimal delay for jailbreak prompts and negligible delay for normal user prompts.
arXiv Detail & Related papers (2024-02-24T05:34:43Z)
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.