Related papers: GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks

GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks

URL: http://arxiv.org/abs/2409.19521v1
Date: Sun, 29 Sep 2024 02:35:38 GMT
Title: GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks
Authors: Rongchang Li, Minjie Chen, Chang Hu, Han Chen, Wenpeng Xing, Meng Han,
Abstract summary: We introduce a novel prompt injection attack detection method, GenTel-Shield, along with a comprehensive evaluation benchmark, GenTel-Bench. To prove the effectiveness of GenTel-Shield, we evaluate it together with vanilla safety guardrails against the GenTel-Bench dataset. Empirically, GenTel-Shield can achieve state-of-the-art attack detection success rates, which reveals the critical weakness of existing safeguarding techniques.
Score: 14.558728517191044
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) like GPT-4, LLaMA, and Qwen have demonstrated remarkable success across a wide range of applications. However, these models remain inherently vulnerable to prompt injection attacks, which can bypass existing safety mechanisms, highlighting the urgent need for more robust attack detection methods and comprehensive evaluation benchmarks. To address these challenges, we introduce GenTel-Safe, a unified framework that includes a novel prompt injection attack detection method, GenTel-Shield, along with a comprehensive evaluation benchmark, GenTel-Bench, which compromises 84812 prompt injection attacks, spanning 3 major categories and 28 security scenarios. To prove the effectiveness of GenTel-Shield, we evaluate it together with vanilla safety guardrails against the GenTel-Bench dataset. Empirically, GenTel-Shield can achieve state-of-the-art attack detection success rates, which reveals the critical weakness of existing safeguarding techniques against harmful prompts. For reproducibility, we have made the code and benchmarking dataset available on the project page at https://gentellab.github.io/gentel-safe.github.io/.

Related papers

ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack [52.17935054046577]
We present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks.<n>ReasAlign incorporates structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks.
arXiv Detail & Related papers (2026-01-15T08:23:38Z)
Securing AI Agents Against Prompt Injection Attacks [0.0]
We present a benchmark for evaluating prompt injection risks in RAG-enabled AI agents.<n>Our framework reduces successful attack rates from 73.2% to 8.7% while maintaining 94.3% of baseline task performance.
arXiv Detail & Related papers (2025-11-19T10:00:54Z)
Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks? [58.48689960350828]
We show that a simple, modular and model-agnostic defense operating at the agent--tool interface achieves perfect security with high utility.<n>We employ a defense based on two firewalls: a Tool-Input Firewall (Minimizer) and a Tool-Output Firewall (Sanitizer)
arXiv Detail & Related papers (2025-10-06T18:09:02Z)
Mind the Gap: Time-of-Check to Time-of-Use Vulnerabilities in LLM-Enabled Agents [4.303444472156151]
Large Language Model (LLM)-enabled agents are rapidly emerging across a wide range of applications.<n>This work presents the first study of time-of-check to time-of-use (TOCTOU) vulnerabilities in LLM-enabled agents.<n>We introduce TOCTOU-Bench, a benchmark with 66 realistic user tasks designed to evaluate this class of vulnerabilities.
arXiv Detail & Related papers (2025-08-23T22:41:49Z)
Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z)
Benchmarking Misuse Mitigation Against Covert Adversaries [80.74502950627736]
Existing language model safety evaluations focus on overt attacks and low-stakes tasks.<n>We develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses.<n>Our evaluations indicate that decomposition attacks are effective misuse enablers, and highlight stateful defenses as a countermeasure.
arXiv Detail & Related papers (2025-06-06T17:33:33Z)
CAPTURE: Context-Aware Prompt Injection Testing and Robustness Enhancement [0.34530027457862006]
We introduce CAPTURE, a novel context-aware benchmark assessing both attack detection and over-defense tendencies.<n>Our experiments reveal that current prompt injection guardrail models suffer from high false negatives in adversarial cases and excessive false positives in benign scenarios.<n>This new model drastically reduces both false negative and false positive rates on our context-aware datasets.
arXiv Detail & Related papers (2025-05-18T11:14:14Z)
SecReEvalBench: A Multi-turned Security Resilience Evaluation Benchmark for Large Language Models [4.788427041690547]
We present SecReEvalBench, the Security Resilience Evaluation Benchmark.<n>It defines four novel metrics: Prompt Attack Resilience Score, Prompt Attack Refusal Logic Score, Chain-Based Attack Resilience Score and Chain-Based Attack Rejection Time Score.<n>We also introduce a dataset customized for the benchmark, which incorporates both neutral and malicious prompts.
arXiv Detail & Related papers (2025-05-12T14:09:24Z)
STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models [31.35788474507371]
Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks. We present STShield, a lightweight framework for real-time jailbroken judgement.
arXiv Detail & Related papers (2025-03-23T04:23:07Z)
The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense [56.32083100401117]
We investigate why Vision Large Language Models (VLLMs) are prone to jailbreak attacks. We then make a key observation: existing defense mechanisms suffer from an textbfover-prudence problem. We find that the two representative evaluation methods for jailbreak often exhibit chance agreement.
arXiv Detail & Related papers (2024-11-13T07:57:19Z)
InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models [7.186499635424984]
Prompt injection attacks pose a critical threat to large language models (LLMs) Prompt guard models, though effective in defense, suffer from over-defense due to trigger word bias. InjecGuard is a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free.
arXiv Detail & Related papers (2024-10-30T07:39:42Z)
SPIN: Self-Supervised Prompt INjection [16.253558670549697]
adversarial and jailbreak attacks have been proposed to bypass the safety alignment and cause the model to produce harmful responses. We introduce Self-supervised Prompt INjection (SPIN) which can detect and reverse these various attacks on LLMs. Our system can reduce the attack success rate by up to 87.9%, while maintaining the performance on benign user requests.
arXiv Detail & Related papers (2024-10-17T05:40:54Z)
SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI [47.11178028457252]
We develop SecCodePLT, a unified and comprehensive evaluation platform for code GenAIs' risks. For insecure code, we introduce a new methodology for data creation that combines experts with automatic generation. For cyberattack helpfulness, we construct samples to prompt a model to generate actual attacks, along with dynamic metrics in our environment.
arXiv Detail & Related papers (2024-10-14T21:17:22Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance [48.80398992974831]
SafeAligner is a methodology implemented at the decoding stage to fortify defenses against jailbreak attacks. We develop two specialized models: the Sentinel Model, which is trained to foster safety, and the Intruder Model, designed to generate riskier responses. We show that SafeAligner can increase the likelihood of beneficial tokens, while reducing the occurrence of harmful ones.
arXiv Detail & Related papers (2024-06-26T07:15:44Z)
Jailbreaking as a Reward Misspecification Problem [80.52431374743998]
We propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness. We present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space.
arXiv Detail & Related papers (2024-06-20T15:12:27Z)
An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulnerabilities against Strong Detection [17.948513691133037]
We introduce CodeBreaker, a pioneering LLM-assisted backdoor attack framework on code completion models. By integrating malicious payloads directly into the source code with minimal transformation, CodeBreaker challenges current security measures.
arXiv Detail & Related papers (2024-06-10T22:10:05Z)
Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications [8.51254190797079]
We introduce the Raccoon benchmark which comprehensively evaluates a model's susceptibility to prompt extraction attacks. Our novel evaluation method assesses models under both defenseless and defended scenarios. Our findings highlight universal susceptibility to prompt theft in the absence of defenses, with OpenAI models demonstrating notable resilience when protected.
arXiv Detail & Related papers (2024-06-10T18:57:22Z)
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks. We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z)
A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems. This paper proposes a self-supervised adversarial training mechanism in the input space. It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.