Related papers: HoneyTrap: Deceiving Large Language Model Attackers to Honeypot Traps with Resilient Multi-Agent Defense

HoneyTrap: Deceiving Large Language Model Attackers to Honeypot Traps with Resilient Multi-Agent Defense

URL: http://arxiv.org/abs/2601.04034v1
Date: Wed, 07 Jan 2026 15:47:28 GMT
Title: HoneyTrap: Deceiving Large Language Model Attackers to Honeypot Traps with Resilient Multi-Agent Defense
Authors: Siyuan Li, Xi Lin, Jun Wu, Zehao Liu, Haoyu Li, Tianjie Ju, Xiang Chen, Jianhua Li,
Abstract summary: HoneyTrap is a novel deceptive language model defense framework.<n>It integrates four defensive agents, Threat Interceptor, Misdirection Controller, Forensic Tracker, and System Harmonizer.<n>HoneyTrap achieves an average reduction of 68.77% in attack success rates compared to state-of-the-art baselines.
Score: 30.85543511046959
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Jailbreak attacks pose significant threats to large language models (LLMs), enabling attackers to bypass safeguards. However, existing reactive defense approaches struggle to keep up with the rapidly evolving multi-turn jailbreaks, where attackers continuously deepen their attacks to exploit vulnerabilities. To address this critical challenge, we propose HoneyTrap, a novel deceptive LLM defense framework leveraging collaborative defenders to counter jailbreak attacks. It integrates four defensive agents, Threat Interceptor, Misdirection Controller, Forensic Tracker, and System Harmonizer, each performing a specialized security role and collaborating to complete a deceptive defense. To ensure a comprehensive evaluation, we introduce MTJ-Pro, a challenging multi-turn progressive jailbreak dataset that combines seven advanced jailbreak strategies designed to gradually deepen attack strategies across multi-turn attacks. Besides, we present two novel metrics: Mislead Success Rate (MSR) and Attack Resource Consumption (ARC), which provide more nuanced assessments of deceptive defense beyond conventional measures. Experimental results on GPT-4, GPT-3.5-turbo, Gemini-1.5-pro, and LLaMa-3.1 demonstrate that HoneyTrap achieves an average reduction of 68.77% in attack success rates compared to state-of-the-art baselines. Notably, even in a dedicated adaptive attacker setting with intensified conditions, HoneyTrap remains resilient, leveraging deceptive engagement to prolong interactions, significantly increasing the time and computational costs required for successful exploitation. Unlike simple rejection, HoneyTrap strategically wastes attacker resources without impacting benign queries, improving MSR and ARC by 118.11% and 149.16%, respectively.

Related papers

TrapSuffix: Proactive Defense Against Adversarial Suffixes in Jailbreaking [52.72486831074384]
Suffix-based jailbreak attacks append an adversarial suffix, i.e., a short token sequence, to steer aligned LLMs into unsafe outputs.<n>We propose TrapSuffix, a lightweight fine-tuning approach that injects trap-aligned behaviors into the base model without changing the inference pipeline.<n>Across diverse suffix-based jailbreak settings, TrapSuffix reduces the average attack success rate to below 0.01 percent and achieves an average tracing success rate of 87.9 percent.
arXiv Detail & Related papers (2026-02-06T11:43:56Z)
Proactive defense against LLM Jailbreak [28.249786308207046]
ProAct is a novel proactive defense framework designed to disrupt and mislead autonomous jailbreaking processes.<n>Our method consistently and significantly reduces attack success rates by up to 92%.
arXiv Detail & Related papers (2025-10-06T17:32:40Z)
Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models [27.23138831535272]
We propose Retrieval-Augmented Defense (RAD), a novel framework for jailbreak detection.<n>RAD incorporates a database of known attack examples into Retrieval-Augmented Generation.<n>Experiments on StrongREJECT show that RAD substantially reduces the effectiveness of strong jailbreak attacks.
arXiv Detail & Related papers (2025-08-22T14:13:16Z)
Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses [4.706534644850809]
Two primary inference-phase threats are token-level and prompt-level jailbreaks.<n>We propose two hybrid approaches that integrate token- and prompt-level techniques to enhance jailbreak effectiveness across diverse PTLMs.
arXiv Detail & Related papers (2025-06-27T07:26:33Z)
TrojanTO: Action-Level Backdoor Attacks against Trajectory Optimization Models [67.06525001375722]
TrojanTO is the first action-level backdoor attack against TO models.<n>It implants backdoor attacks across diverse tasks and attack objectives with a low attack budget.<n>TrojanTO exhibits broad applicability to DT, GDT, and DC.
arXiv Detail & Related papers (2025-06-15T11:27:49Z)
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment [97.38766396447369]
Despite training-time safety alignment, Multimodal Large Language Models (MLLMs) remain vulnerable to jailbreak attacks.<n>We propose Immune, an inference-time defense framework that leverages a safe reward model through controlled decoding to defend against jailbreak attacks.
arXiv Detail & Related papers (2024-11-27T19:00:10Z)
Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models [50.89022445197919]
Large language models (LLMs) have exhibited outstanding performance in engaging with humans. LLMs are vulnerable to jailbreak attacks, leading to the generation of harmful responses. We propose Jigsaw Puzzles (JSP), a straightforward yet effective multi-turn jailbreak strategy against the advanced LLMs.
arXiv Detail & Related papers (2024-10-15T10:07:15Z)
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner [21.414701448926614]
This paper introduces a generic LLM jailbreak defense framework called SelfDefend.<n>We empirically validate using mainstream GPT-3.5/4 models against major jailbreak attacks.<n>To further improve the defense's robustness and minimize costs, we employ a data distillation approach to tune dedicated open-source defense models.
arXiv Detail & Related papers (2024-06-08T15:45:31Z)
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques. We propose three comprehensive, automated, and logical frameworks. We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z)
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)<n>Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.<n> Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks [17.22989422489567]
Large language models (LLMs) are vulnerable to adversarial attacks or jailbreaking. We propose an optimization-based objective for defending LLMs against jailbreaking attacks and an algorithm to create robust system-level defenses. Our results show improved robustness to both jailbreaks seen during optimization and unknown jailbreaks, reducing the attack success rate (ASR) on GPT-4 to 6% and Llama-2 to 0% on JailbreakBench.
arXiv Detail & Related papers (2024-01-30T18:56:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.