AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
- URL: http://arxiv.org/abs/2504.09466v1
- Date: Sun, 13 Apr 2025 07:39:17 GMT
- Title: AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
- Authors: Weixiang Zhao, Jiahe Guo, Yulin Hu, Yang Deng, An Zhang, Xingyu Sui, Xinyang Han, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu,
- Abstract summary: We propose AdaSteer, an adaptive activation steering method that adjusts model behavior based on input characteristics.<n>AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD)<n>Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.
- Score: 73.09848497762667
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite extensive efforts in safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks. Activation steering offers a training-free defense method but relies on fixed steering coefficients, resulting in suboptimal protection and increased false rejections of benign inputs. To address this, we propose AdaSteer, an adaptive activation steering method that dynamically adjusts model behavior based on input characteristics. We identify two key properties: Rejection Law (R-Law), which shows that stronger steering is needed for jailbreak inputs opposing the rejection direction, and Harmfulness Law (H-Law), which differentiates adversarial and benign inputs. AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD), with adaptive coefficients learned via logistic regression, ensuring robust jailbreak defense while preserving benign input handling. Experiments on LLaMA-3.1, Gemma-2, and Qwen2.5 show that AdaSteer outperforms baseline methods across multiple jailbreak attacks with minimal impact on utility. Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.
Related papers
- Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z) - Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks [16.508109544083496]
Vision Language Models (VLMs) can produce unintended and harmful content when exposed to adversarial attacks.<n>Existing defenses, such as input preprocessing, adversarial training, and response evaluation-based methods, are often impractical for real-world deployment.<n>We propose ASTRA, an efficient and effective defense by adaptively steering models away from adversarial feature directions to resist VLM attacks.
arXiv Detail & Related papers (2024-11-23T02:17:17Z) - Steering Language Model Refusal with Sparse Autoencoders [16.78963326253821]
We identify and steer features in Phi-3 Mini that mediate refusal behavior.
We find that feature steering can improve Phi-3 Minis robustness to jailbreak attempts across various harms.
However, feature steering can adversely affect overall performance on benchmarks.
arXiv Detail & Related papers (2024-11-18T05:47:02Z) - Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models [8.024771725860127]
Jailbreak attacks manipulate large language models into generating harmful content.
Jailbreak Antidote enables real-time adjustment of safety preferences by manipulating a sparse subset of the model's internal states.
Our analysis reveals that safety-related information in LLMs is sparsely distributed.
arXiv Detail & Related papers (2024-10-03T08:34:17Z) - Steering Without Side Effects: Improving Post-Deployment Control of Language Models [61.99293520621248]
Language models (LMs) have been shown to behave unexpectedly post-deployment.
We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits.
Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model.
arXiv Detail & Related papers (2024-06-21T01:37:39Z) - Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)
Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.
Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z) - Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes [61.916827858666906]
Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a query and the LLM generates an answer.
To reduce harm and misuse, efforts have been made to align these LLMs to human values using advanced training techniques such as Reinforcement Learning from Human Feedback.
Recent studies have highlighted the vulnerability of LLMs to adversarial jailbreak attempts aiming at subverting the embedded safety guardrails.
This paper proposes a method called Gradient Cuff to detect jailbreak attempts.
arXiv Detail & Related papers (2024-03-01T03:29:54Z) - InferAligner: Inference-Time Alignment for Harmlessness through
Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment.
Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics.
It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z) - Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment [31.24530091590395]
We study an attack scenario called Trojan Activation Attack (TA2), which injects trojan steering vectors into the activation layers of Large Language Models.
Our experiment results show that TA2 is highly effective and adds little or no overhead to attack efficiency.
arXiv Detail & Related papers (2023-11-15T23:07:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.