Related papers: Machine Learning for Detection and Analysis of Novel LLM Jailbreaks

Machine Learning for Detection and Analysis of Novel LLM Jailbreaks

URL: http://arxiv.org/abs/2510.01644v2
Date: Fri, 10 Oct 2025 05:08:44 GMT
Title: Machine Learning for Detection and Analysis of Novel LLM Jailbreaks
Authors: John Hawkins, Aditya Pramar, Rodney Beard, Rohitash Chandra,
Abstract summary: Large Language Models (LLMs) suffer from a range of vulnerabilities that allow malicious users to solicit undesirable responses through manipulation of the input text.<n>These so-called jailbreak prompts are designed to trick the LLM into circumventing the safety guardrails put in place to keep responses acceptable to the developer's policies.<n>In this study, we analyse the ability of different machine learning models to distinguish jailbreak prompts from genuine uses.
Score: 3.2654923574107357
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) suffer from a range of vulnerabilities that allow malicious users to solicit undesirable responses through manipulation of the input text. These so-called jailbreak prompts are designed to trick the LLM into circumventing the safety guardrails put in place to keep responses acceptable to the developer's policies. In this study, we analyse the ability of different machine learning models to distinguish jailbreak prompts from genuine uses, including looking at our ability to identify jailbreaks that use previously unseen strategies. Our results indicate that using current datasets the best performance is achieved by fine tuning a Bidirectional Encoder Representations from Transformers (BERT) model end-to-end for identifying jailbreaks. We visualise the keywords that distinguish jailbreak from genuine prompts and conclude that explicit reflexivity in prompt structure could be a signal of jailbreak intention.

Related papers

Imperceptible Jailbreaking against Large Language Models [107.76039200173528]
We introduce imperceptible jailbreaks that exploit a class of Unicode characters called variation selectors.<n>By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen.<n>We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses.
arXiv Detail & Related papers (2025-10-06T17:03:50Z)
Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning [48.100552417137656]
PASS employs reinforcement learning to transform initial jailbreak prompts into formalized descriptions.<n>We conduct extensive experiments on common open-source models, demonstrating the effectiveness of our attack.
arXiv Detail & Related papers (2025-09-28T01:38:00Z)
LLM Jailbreak Detection for (Almost) Free! [62.466970731998714]
Large language models (LLMs) enhance security through alignment when widely used, but remain susceptible to jailbreak attacks.<n>Jailbreak detection methods show promise in mitigating jailbreak attacks through the assistance of other models or multiple model inferences.<n>We propose a Free Jailbreak Detection (FJD) which prepends an affirmative instruction to the input and scales the logits by temperature to further distinguish between jailbreak and benign prompts.
arXiv Detail & Related papers (2025-09-18T02:42:52Z)
Anyone Can Jailbreak: Prompt-Based Attacks on LLMs and T2Is [8.214994509812724]
Large language models (LLMs) and text-to-image (T2I) systems remain vulnerable to prompt-based attacks known as jailbreaks.<n>This paper presents a systems-style investigation into how non-experts reliably circumvent safety mechanisms.<n>We propose a unified taxonomy of prompt-level jailbreak strategies spanning both text-output and T2I models.
arXiv Detail & Related papers (2025-07-29T13:55:23Z)
Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction [32.49886313949869]
We propose a transferable black-box jailbreak method to attack Large Language Models (LLMs)<n>We find that this rewriting approach is learnable and transferable.<n> Extensive experiments and analysis demonstrate the effectiveness of R2J.
arXiv Detail & Related papers (2025-02-16T11:43:39Z)
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit [12.392258585661446]
Large language models (LLMs) are vulnerable to jailbreak attacks, wherein adversarial prompts are crafted to bypass their security mechanisms and elicit unexpected responses.<n>This paper proposes JailbreakLens, an interpretation framework that analyzes jailbreak mechanisms from both representation and circuit perspectives.
arXiv Detail & Related papers (2024-11-17T16:08:34Z)
EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications. LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z)
A StrongREJECT for Empty Jailbreaks [72.8807309802266]
StrongREJECT is a high-quality benchmark for evaluating jailbreak performance. It scores the harmfulness of a victim model's responses to forbidden prompts. It achieves state-of-the-art agreement with human judgments of jailbreak effectiveness.
arXiv Detail & Related papers (2024-02-15T18:58:09Z)
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses. adversarial prompts known as 'jailbreaks' can circumvent safeguards. We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.