AlignTree: Efficient Defense Against LLM Jailbreak Attacks
- URL: http://arxiv.org/abs/2511.12217v1
- Date: Sat, 15 Nov 2025 13:42:22 GMT
- Title: AlignTree: Efficient Defense Against LLM Jailbreak Attacks
- Authors: Gil Goren, Shahar Katz, Lior Wolf,
- Abstract summary: Large Language Models (LLMs) are vulnerable to adversarial attacks that bypass safety guidelines and generate harmful content.<n>We introduce the AlignTree defense, which enhances model alignment while maintaining minimal computational overhead.
- Score: 48.805151796878505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are vulnerable to adversarial attacks that bypass safety guidelines and generate harmful content. Mitigating these vulnerabilities requires defense mechanisms that are both robust and computationally efficient. However, existing approaches either incur high computational costs or rely on lightweight defenses that can be easily circumvented, rendering them impractical for real-world LLM-based systems. In this work, we introduce the AlignTree defense, which enhances model alignment while maintaining minimal computational overhead. AlignTree monitors LLM activations during generation and detects misaligned behavior using an efficient random forest classifier. This classifier operates on two signals: (i) the refusal direction -- a linear representation that activates on misaligned prompts, and (ii) an SVM-based signal that captures non-linear features associated with harmful content. Unlike previous methods, AlignTree does not require additional prompts or auxiliary guard models. Through extensive experiments, we demonstrate the efficiency and robustness of AlignTree across multiple LLMs and benchmarks.
Related papers
- DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern [23.834578989358423]
We introduce DualSentinel, a lightweight and unified defense framework.<n>It can accurately and promptly detect the activation of targeted attacks alongside the Large Language Models generation process.<n>It is highly effective (superior detection accuracy with near-zero false positives) and remarkably efficient (negligible additional cost)
arXiv Detail & Related papers (2026-03-02T08:02:47Z) - Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security [12.835224376066769]
We introduce Q-MLLM, a novel architecture that integrates two-level vector quantization to create a discrete bottleneck against adversarial attacks.<n> Experiments demonstrate that Q-MLLM achieves significantly better defense success rate against both jailbreak attacks and toxic image attacks than existing approaches.
arXiv Detail & Related papers (2025-11-20T10:55:19Z) - PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization [0.0]
This paper introduces a novel framework for hardening system prompts through shield appending.<n>We leverage an LLM-as-optimizer to search the space of possible SHIELDs, seeking to minimize a leakage metric derived from a suite of adversarial attacks.<n>We demonstrate empirically that our optimized SHIELDs significantly reduce prompt leakage against a comprehensive set of extraction attacks.
arXiv Detail & Related papers (2025-11-20T10:25:45Z) - Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs [0.0]
We investigate various constrained optimization formulations that address unlearning of sensitive information and robustness to jail-breaking attacks.<n>We find that the simplest point-wise constraint-based intervention we propose leads to better performance than max-min interventions, while having a lower computational cost.
arXiv Detail & Related papers (2025-10-03T23:32:21Z) - TreeLoRA: Efficient Continual Learning via Layer-Wise LoRAs Guided by a Hierarchical Gradient-Similarity Tree [52.44403214958304]
In this paper, we introduce TreeLoRA, a novel approach that constructs layer-wise adapters by leveraging hierarchical gradient similarity.<n>To reduce the computational burden of task similarity estimation, we employ bandit techniques to develop an algorithm based on lower confidence bounds.<n> experiments on both vision transformers (ViTs) and large language models (LLMs) demonstrate the effectiveness and efficiency of our approach.
arXiv Detail & Related papers (2025-06-12T05:25:35Z) - Robust Anti-Backdoor Instruction Tuning in LVLMs [53.766434746801366]
We introduce a lightweight, certified-agnostic defense framework for large visual language models (LVLMs)<n>Our framework finetunes only adapter modules and text embedding layers under instruction tuning.<n>Experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero.
arXiv Detail & Related papers (2025-06-04T01:23:35Z) - STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models [31.35788474507371]
Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks.<n>We present STShield, a lightweight framework for real-time jailbroken judgement.
arXiv Detail & Related papers (2025-03-23T04:23:07Z) - Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification [17.500701903902094]
Large Language Models (LLMs) are vulnerable to jailbreak attacks, which use crafted prompts to elicit toxic responses.<n>This paper proposes DEEPALIGN, a robust defense framework that fine-tunes LLMs to progressively detoxify generated content.
arXiv Detail & Related papers (2025-03-14T08:32:12Z) - Efficient Adversarial Training in LLMs with Continuous Attacks [99.5882845458567]
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails.
We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses.
C-AdvIPO is an adversarial variant of IPO that does not require utility data for adversarially robust alignment.
arXiv Detail & Related papers (2024-05-24T14:20:09Z) - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks.
Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z) - Defending Large Language Models against Jailbreak Attacks via Semantic
Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks.
We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.