Analysis of LLMs Against Prompt Injection and Jailbreak Attacks
- URL: http://arxiv.org/abs/2602.22242v1
- Date: Tue, 24 Feb 2026 12:32:11 GMT
- Title: Analysis of LLMs Against Prompt Injection and Jailbreak Attacks
- Authors: Piyush Jaiswal, Aaditya Pratap, Shreyansh Saraswati, Harsh Kasyap, Somanath Tripathy,
- Abstract summary: This work evaluates prompt-injection and jailbreak vulnerability using a large, manually curated dataset.<n>We observe significant behavioural variation across models, including refusal responses and complete silent non-responsiveness triggered by internal safety mechanisms.
- Score: 7.685814179879813
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are widely deployed in real-world systems. Given their broader applicability, prompt engineering has become an efficient tool for resource-scarce organizations to adopt LLMs for their own purposes. At the same time, LLMs are vulnerable to prompt-based attacks. Thus, analyzing this risk has become a critical security requirement. This work evaluates prompt-injection and jailbreak vulnerability using a large, manually curated dataset across multiple open-source LLMs, including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen, and Gemma variants. We observe significant behavioural variation across models, including refusal responses and complete silent non-responsiveness triggered by internal safety mechanisms. Furthermore, we evaluated several lightweight, inference-time defence mechanisms that operate as filters without any retraining or GPU-intensive fine-tuning. Although these defences mitigate straightforward attacks, they are consistently bypassed by long, reasoning-heavy prompts.
Related papers
- PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization [0.0]
This paper introduces a novel framework for hardening system prompts through shield appending.<n>We leverage an LLM-as-optimizer to search the space of possible SHIELDs, seeking to minimize a leakage metric derived from a suite of adversarial attacks.<n>We demonstrate empirically that our optimized SHIELDs significantly reduce prompt leakage against a comprehensive set of extraction attacks.
arXiv Detail & Related papers (2025-11-20T10:25:45Z) - Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack [53.34204977366491]
Large language models (LLMs) remain vulnerable to jailbreaking attacks despite their impressive capabilities.<n>In this paper, we introduce ISA (Intent Shift Attack), which obfuscates LLMs about the intent of the attacks.<n>Our approach only needs minimal edits to the original request, and yields natural, human-readable, and seemingly harmless prompts.
arXiv Detail & Related papers (2025-11-01T13:44:42Z) - AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models [0.0]
Large Language Models (LLMs) continue to exhibit vulnerabilities to jailbreaking attacks.<n>We present AutoAdv, a novel framework that automates adversarial prompt generation.<n>We show that our attacks achieve jailbreak success rates of up to 86% for harmful content generation.
arXiv Detail & Related papers (2025-04-18T08:38:56Z) - Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks [88.84977282952602]
A high volume of recent ML security literature focuses on attacks against aligned large language models (LLMs)<n>In this paper, we analyze security and privacy vulnerabilities that are unique to LLM agents.<n>We conduct a series of illustrative attacks on popular open-source and commercial agents, demonstrating the immediate practical implications of their vulnerabilities.
arXiv Detail & Related papers (2025-02-12T17:19:36Z) - Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)<n>Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.<n> Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z) - Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing [14.094372002702476]
Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications.
Recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts.
We propose a novel defense method termed textbfLayer-specific textbfEditing (LED) to enhance the resilience of LLMs against jailbreak attacks.
arXiv Detail & Related papers (2024-05-28T13:26:12Z) - Prompt Leakage effect and defense strategies for multi-turn LLM interactions [95.33778028192593]
Leakage of system prompts may compromise intellectual property and act as adversarial reconnaissance for an attacker.
We design a unique threat model which leverages the LLM sycophancy effect and elevates the average attack success rate (ASR) from 17.7% to 86.2% in a multi-turn setting.
We measure the mitigation effect of 7 black-box defense strategies, along with finetuning an open-source model to defend against leakage attempts.
arXiv Detail & Related papers (2024-04-24T23:39:58Z) - CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models [6.931433424951554]
Large language models (LLMs) introduce new security risks, but there are few comprehensive evaluation suites to measure and reduce these risks.
We present BenchmarkName, a novel benchmark to quantify LLM security risks and capabilities.
We evaluate multiple state-of-the-art (SOTA) LLMs, including GPT-4, Mistral, Meta Llama 3 70B-Instruct, and Code Llama.
arXiv Detail & Related papers (2024-04-19T20:11:12Z) - Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes [0.0]
Large Language Models (LLMs) have gained widespread adoption across various domains, including chatbots and auto-task completion agents.
These models are susceptible to safety vulnerabilities such as jailbreaking, prompt injection, and privacy leakage attacks.
This study investigates the impact of these modifications on LLM safety, a critical consideration for building reliable and secure AI systems.
arXiv Detail & Related papers (2024-04-05T20:31:45Z) - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks.
Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z) - Not what you've signed up for: Compromising Real-World LLM-Integrated
Applications with Indirect Prompt Injection [64.67495502772866]
Large Language Models (LLMs) are increasingly being integrated into various applications.
We show how attackers can override original instructions and employed controls using Prompt Injection attacks.
We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities.
arXiv Detail & Related papers (2023-02-23T17:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.