Related papers: Mitigating Jailbreaks with Intent-Aware LLMs

Mitigating Jailbreaks with Intent-Aware LLMs

URL: http://arxiv.org/abs/2508.12072v2
Date: Sat, 23 Aug 2025 06:11:04 GMT
Title: Mitigating Jailbreaks with Intent-Aware LLMs
Authors: Wei Jie Yeo, Ranjan Satapathy, Erik Cambria,
Abstract summary: Large language models (LLMs) remain vulnerable to jailbreak attacks via adversarially crafted instructions.<n>We propose Intent-FT, a simple and lightweight fine-tuning approach that explicitly trains LLMs to infer the underlying intent of an instruction before responding.<n> Empirically, Intent-FT consistently mitigates all evaluated attack categories, with no single attack exceeding a 50% success rate.
Score: 42.48292327349576
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite extensive safety-tuning, large language models (LLMs) remain vulnerable to jailbreak attacks via adversarially crafted instructions, reflecting a persistent trade-off between safety and task performance. In this work, we propose Intent-FT, a simple and lightweight fine-tuning approach that explicitly trains LLMs to infer the underlying intent of an instruction before responding. By fine-tuning on a targeted set of adversarial instructions, Intent-FT enables LLMs to generalize intent deduction to unseen attacks, thereby substantially improving their robustness. We comprehensively evaluate both parametric and non-parametric attacks across open-source and proprietary models, considering harmfulness from attacks, utility, over-refusal, and impact against white-box threats. Empirically, Intent-FT consistently mitigates all evaluated attack categories, with no single attack exceeding a 50\% success rate -- whereas existing defenses remain only partially effective. Importantly, our method preserves the model's general capabilities and reduces excessive refusals on benign instructions containing superficially harmful keywords. Furthermore, models trained with Intent-FT accurately identify hidden harmful intent in adversarial attacks, and these learned intentions can be effectively transferred to enhance vanilla model defenses. We publicly release our code at https://github.com/wj210/Intent_Jailbreak.

Related papers

ICL-EVADER: Zero-Query Black-Box Evasion Attacks on In-Context Learning and Their Defenses [8.57098009274006]
In-context learning (ICL) has become a powerful, data-efficient paradigm for text classification using large language models.<n>We introduce ICL-Evader, a novel black-box evasion attack framework that operates under a highly practical zero-query threat model.
arXiv Detail & Related papers (2026-01-29T11:50:50Z)
Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack [53.34204977366491]
Large language models (LLMs) remain vulnerable to jailbreaking attacks despite their impressive capabilities.<n>In this paper, we introduce ISA (Intent Shift Attack), which obfuscates LLMs about the intent of the attacks.<n>Our approach only needs minimal edits to the original request, and yields natural, human-readable, and seemingly harmless prompts.
arXiv Detail & Related papers (2025-11-01T13:44:42Z)
ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack [22.48980625853356]
Large language models (LLMs) exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes.<n>In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability.
arXiv Detail & Related papers (2025-09-30T06:33:52Z)
PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking [3.718606661938873]
We propose a novel and effective jailbreak framework inspired by Return-Oriented Programming (ROP) techniques from software security.<n>Our approach decomposes a harmful instruction into a sequence of individually benign visual gadgets.<n>Our findings reveal a critical and underexplored vulnerability that exploits the compositional reasoning abilities of LVLMs.
arXiv Detail & Related papers (2025-07-29T07:13:56Z)
ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning [64.32925552574115]
ARMOR is a large language model that analyzes jailbreak strategies and extracts the core intent.<n> ARMOR achieves state-of-the-art safety performance, with an average harmful rate of 0.002 and an attack success rate of 0.06 against advanced optimization-based jailbreaks.
arXiv Detail & Related papers (2025-07-14T09:05:54Z)
DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification [18.006622965818856]
We introduce DETAM, a finetuning-free defense approach that improves the defensive capabilities against jailbreak attacks of LLMs.<n>Specifically, we analyze the differences in attention scores between successful and unsuccessful defenses to identify the attention heads sensitive to jailbreak attacks.<n>During inference, we reallocate attention to emphasize the user's core intention, minimizing interference from attack tokens.
arXiv Detail & Related papers (2025-04-18T09:02:12Z)
LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution [84.2846064139183]
Large Language Models (LLMs) face threats from jailbreak prompts.<n>We propose LightDefense, a lightweight defense mechanism targeted at white-box models.
arXiv Detail & Related papers (2025-04-02T09:21:26Z)
Understanding and Enhancing the Transferability of Jailbreaking Attacks [12.446931518819875]
Jailbreaking attacks can effectively manipulate open-source large language models (LLMs) to produce harmful responses.<n>This work investigates the transferability of jailbreaking attacks by analysing their impact on the model's intent perception.<n>We propose the Perceived-importance Flatten (PiF) method, which uniformly disperses the model's focus across neutral-intent tokens in the original input.
arXiv Detail & Related papers (2025-02-05T10:29:54Z)
Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [51.51850981481236]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z)
An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks.<n>Our threat model checks if a given jailbreak is likely to occur in the distribution of text.<n>We adapt popular attacks to this threat model, and, for the first time, benchmark these attacks on equal footing with it.
arXiv Detail & Related papers (2024-10-21T17:27:01Z)
Robust LLM safeguarding via refusal feature adversarial training [15.76605079209956]
Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses.<n>We propose Refusal Feature Adrial Training (ReFAT), a novel algorithm that efficiently performs adversarial training.<n>Experiment results show that ReFAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks.
arXiv Detail & Related papers (2024-09-30T08:41:39Z)
Improved Generation of Adversarial Examples Against Safety-aligned LLMs [72.38072942860309]
Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks. We show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench.
arXiv Detail & Related papers (2024-05-28T06:10:12Z)
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment [31.24530091590395]
We study an attack scenario called Trojan Activation Attack (TA2), which injects trojan steering vectors into the activation layers of Large Language Models. Our experiment results show that TA2 is highly effective and adds little or no overhead to attack efficiency.
arXiv Detail & Related papers (2023-11-15T23:07:40Z)
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs) Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.