A Systematic Literature Review on LLM Defenses Against Prompt Injection and Jailbreaking: Expanding NIST Taxonomy
- URL: http://arxiv.org/abs/2601.22240v1
- Date: Thu, 29 Jan 2026 19:08:37 GMT
- Title: A Systematic Literature Review on LLM Defenses Against Prompt Injection and Jailbreaking: Expanding NIST Taxonomy
- Authors: Pedro H. Barcha Correia, Ryan W. Achjian, Diego E. G. Caetano de Oliveira, Ygor Acacio Maria, Victor Takashi Hayashi, Marcos Lopes, Charles Christian Miers, Marcos A. Simplicio,
- Abstract summary: This work presents the first systematic literature review on prompt injection mitigation strategies.<n>It identifies studies beyond those documented in NIST's report and other academic reviews and surveys.<n>By adopting NIST's established terminology and taxonomy as a foundation, we promote consistency and enable future researchers to build upon the standardized taxonomy proposed in this work.
- Score: 0.5958527024442727
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The rapid advancement and widespread adoption of generative artificial intelligence (GenAI) and large language models (LLMs) has been accompanied by the emergence of new security vulnerabilities and challenges, such as jailbreaking and other prompt injection attacks. These maliciously crafted inputs can exploit LLMs, causing data leaks, unauthorized actions, or compromised outputs, for instance. As both offensive and defensive prompt injection techniques evolve quickly, a structured understanding of mitigation strategies becomes increasingly important. To address that, this work presents the first systematic literature review on prompt injection mitigation strategies, comprehending 88 studies. Building upon NIST's report on adversarial machine learning, this work contributes to the field through several avenues. First, it identifies studies beyond those documented in NIST's report and other academic reviews and surveys. Second, we propose an extension to NIST taxonomy by introducing additional categories of defenses. Third, by adopting NIST's established terminology and taxonomy as a foundation, we promote consistency and enable future researchers to build upon the standardized taxonomy proposed in this work. Finally, we provide a comprehensive catalog of the reviewed prompt injection defenses, documenting their reported quantitative effectiveness across specific LLMs and attack datasets, while also indicating which solutions are open-source and model-agnostic. This catalog, together with the guidelines presented herein, aims to serve as a practical resource for researchers advancing the field of adversarial machine learning and for developers seeking to implement effective defenses in production systems.
Related papers
- The Landscape of Prompt Injection Threats in LLM Agents: From Taxonomy to Analysis [24.51410516475904]
This SoK presents a comprehensive overview of the Prompt Injection (PI) landscape, covering attacks, defenses, and their evaluation practices.<n>We introduce AgentPI, a new benchmark designed to systematically evaluate agent behavior under context-dependent interaction settings.<n>We show that many defenses appear effective under existing benchmarks by suppressing contextual inputs, yet fail to generalize to realistic agent settings where context-dependent reasoning is essential.
arXiv Detail & Related papers (2026-02-11T02:47:10Z) - SoK: Large Language Model Copyright Auditing via Fingerprinting [69.14570598973195]
We introduce a unified framework and formal taxonomy that categorizes existing methods into white-box and black-box approaches.<n>We propose LeaFBench, the first systematic benchmark for evaluating LLM fingerprinting under realistic deployment scenarios.
arXiv Detail & Related papers (2025-08-27T12:56:57Z) - A Systematic Survey of Model Extraction Attacks and Defenses: State-of-the-Art and Perspectives [65.3369988566853]
Recent studies have demonstrated that adversaries can replicate a target model's functionality.<n>Model Extraction Attacks pose threats to intellectual property, privacy, and system security.<n>We propose a novel taxonomy that classifies MEAs according to attack mechanisms, defense approaches, and computing environments.
arXiv Detail & Related papers (2025-08-20T19:49:59Z) - Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers [61.57691030102618]
We propose a novel jailbreaking method, Paper Summary Attack (llmnamePSA)<n>It synthesizes content from either attack-focused or defense-focused LLM safety paper to construct an adversarial prompt template.<n>Experiments show significant vulnerabilities not only in base LLMs, but also in state-of-the-art reasoning model like Deepseek-R1.
arXiv Detail & Related papers (2025-07-17T18:33:50Z) - A Survey on Model Extraction Attacks and Defenses for Large Language Models [55.60375624503877]
Model extraction attacks pose significant security threats to deployed language models.<n>This survey provides a comprehensive taxonomy of extraction attacks and defenses, categorizing attacks into functionality extraction, training data extraction, and prompt-targeted attacks.<n>We examine defense mechanisms organized into model protection, data privacy protection, and prompt-targeted strategies, evaluating their effectiveness across different deployment scenarios.
arXiv Detail & Related papers (2025-06-26T22:02:01Z) - Illusions of Relevance: Using Content Injection Attacks to Deceive Retrievers, Rerankers, and LLM Judges [52.96987928118327]
We find that embedding models for retrieval, rerankers, and large language model (LLM) relevance judges are vulnerable to content injection attacks.<n>We identify two primary threats: (1) inserting unrelated or harmful content within passages that still appear deceptively "relevant", and (2) inserting entire queries or key query terms into passages to boost their perceived relevance.<n>Our study systematically examines the factors that influence an attack's success, such as the placement of injected content and the balance between relevant and non-relevant material.
arXiv Detail & Related papers (2025-01-30T18:02:15Z) - Attention Tracker: Detecting Prompt Injection Attacks in LLMs [62.247841717696765]
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks.<n>We introduce the concept of the distraction effect, where specific attention heads shift focus from the original instruction to the injected instruction.<n>We propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks.
arXiv Detail & Related papers (2024-11-01T04:05:59Z) - Jailbreaking and Mitigation of Vulnerabilities in Large Language Models [8.345554966569479]
Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation.<n>Despite these advancements, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks.<n>This review analyzes the state of research on these vulnerabilities and presents available defense strategies.
arXiv Detail & Related papers (2024-10-20T00:00:56Z) - Counterfactual Explainable Incremental Prompt Attack Analysis on Large Language Models [32.03992137755351]
This study sheds light on the imperative need to bolster safety and privacy measures in large language models (LLMs)
We propose Counterfactual Explainable Incremental Prompt Attack (CEIPA), a novel technique where we guide prompts in a specific manner to quantitatively measure attack effectiveness.
arXiv Detail & Related papers (2024-07-12T14:26:14Z) - Topological safeguard for evasion attack interpreting the neural
networks' behavior [0.0]
In this work, a novel detector of evasion attacks is developed.
It focuses on the information of the activations of the neurons given by the model when an input sample is injected.
For this purpose, a huge data preprocessing is required to introduce all this information in the detector.
arXiv Detail & Related papers (2024-02-12T08:39:40Z) - Survey of Vulnerabilities in Large Language Models Revealed by
Adversarial Attacks [5.860289498416911]
Large Language Models (LLMs) are swiftly advancing in architecture and capability.
As they integrate more deeply into complex systems, the urgency to scrutinize their security properties grows.
This paper surveys research in the emerging interdisciplinary field of adversarial attacks on LLMs.
arXiv Detail & Related papers (2023-10-16T21:37:24Z) - Not what you've signed up for: Compromising Real-World LLM-Integrated
Applications with Indirect Prompt Injection [64.67495502772866]
Large Language Models (LLMs) are increasingly being integrated into various applications.
We show how attackers can override original instructions and employed controls using Prompt Injection attacks.
We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities.
arXiv Detail & Related papers (2023-02-23T17:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.