InjectLab: A Tactical Framework for Adversarial Threat Modeling Against Large Language Models
- URL: http://arxiv.org/abs/2505.18156v1
- Date: Wed, 16 Apr 2025 05:00:56 GMT
- Title: InjectLab: A Tactical Framework for Adversarial Threat Modeling Against Large Language Models
- Authors: Austin Howard,
- Abstract summary: This paper introduces InjectLab as a structured, open-source matrix that maps real-world techniques used to manipulate language models.<n>The framework is inspired by MITRE ATT&CK and focuses specifically on adversarial behavior at the prompt layer.<n>It includes over 25 techniques organized under six core tactics, covering threats like instruction override, identity swapping, and multi-agent exploitation.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are changing the way people interact with technology. Tools like ChatGPT and Claude AI are now common in business, research, and everyday life. But with that growth comes new risks, especially prompt-based attacks that exploit how these models process language. InjectLab is a security framework designed to address that problem. This paper introduces InjectLab as a structured, open-source matrix that maps real-world techniques used to manipulate LLMs. The framework is inspired by MITRE ATT&CK and focuses specifically on adversarial behavior at the prompt layer. It includes over 25 techniques organized under six core tactics, covering threats like instruction override, identity swapping, and multi-agent exploitation. Each technique in InjectLab includes detection guidance, mitigation strategies, and YAML-based simulation tests. A Python tool supports easy execution of prompt-based test cases. This paper outlines the framework's structure, compares it to other AI threat taxonomies, and discusses its future direction as a practical, community-driven foundation for securing language models.
Related papers
- Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities [76.9327488986162]
Existing attacks against multimodal language models (MLLMs) primarily communicate instructions through text accompanied by adversarial images.<n>We exploit the capabilities of MLLMs to interpret non-textual instructions, specifically, adversarial images or audio generated by our novel method, Con Instruction.<n>Our method achieves the highest attack success rates, reaching 81.3% and 86.6% on LLaVA-v1.5 (13B)
arXiv Detail & Related papers (2025-05-31T13:11:14Z) - Automated Red Teaming with GOAT: the Generative Offensive Agent Tester [8.947465706080523]
Red teaming assesses how large language models can produce content that violates norms, policies, and rules set during their safety training.
Most existing automated methods in the literature are not representative of the way humans tend to interact with AI models.
We introduce Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations.
arXiv Detail & Related papers (2024-10-02T14:47:05Z) - Well, that escalated quickly: The Single-Turn Crescendo Attack (STCA) [0.0]
This paper introduces a new method for adversarial attacks on large language models (LLMs) called the Single-Turn Crescendo Attack (STCA)
Building on the multi-turn crescendo attack method introduced by Russinovich, Salem, and Eldan (2024), the STCA achieves similar outcomes in a single interaction.
arXiv Detail & Related papers (2024-09-04T23:45:10Z) - Compromising Embodied Agents with Contextual Backdoor Attacks [69.71630408822767]
Large language models (LLMs) have transformed the development of embodied intelligence.
This paper uncovers a significant backdoor security threat within this process.
By poisoning just a few contextual demonstrations, attackers can covertly compromise the contextual environment of a black-box LLM.
arXiv Detail & Related papers (2024-08-06T01:20:12Z) - Jailbreaking Text-to-Image Models with LLM-Based Agents [15.582860145268553]
We propose an advanced multi-agent framework targeting generative AI models, specifically focusing on jailbreak attacks against text-to-image (T2I) models with built-in safety filters.
Our evaluation demonstrates that Atlas successfully jailbreaks several state-of-the-art T2I models equipped with multi-modal safety filters in a black-box setting.
arXiv Detail & Related papers (2024-08-01T12:54:46Z) - MetaReflection: Learning Instructions for Language Agents using Past Reflections [11.028256182234017]
We introduce MetaReflection, a novel offline reinforcement learning technique that enhances the performance of Language Agents.
We demonstrate the efficacy of MetaReflection by evaluating across multiple domains, including complex logical reasoning, biomedical semantic similarity, open world question answering, and vulnerability threat detection.
arXiv Detail & Related papers (2024-05-13T10:51:43Z) - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks.
Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z) - A Trembling House of Cards? Mapping Adversarial Attacks against Language
Agents [37.978142062138986]
We present the first systematic effort in mapping adversarial attacks against language agents.
We propose 12 potential attack scenarios against different components of an agent, covering different attack strategies.
We emphasize the urgency to gain a thorough understanding of language agent risks before their widespread deployment.
arXiv Detail & Related papers (2024-02-15T18:51:32Z) - Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning [50.47568731994238]
Key method for creating Artificial Intelligence (AI) agents is Reinforcement Learning (RL)
This paper presents a general framework model for integrating and learning structured reasoning into AI agents' policies.
arXiv Detail & Related papers (2023-12-22T17:57:57Z) - A LLM Assisted Exploitation of AI-Guardian [57.572998144258705]
We evaluate the robustness of AI-Guardian, a recent defense to adversarial examples published at IEEE S&P 2023.
We write none of the code to attack this model, and instead prompt GPT-4 to implement all attack algorithms following our instructions and guidance.
This process was surprisingly effective and efficient, with the language model at times producing code from ambiguous instructions faster than the author of this paper could have done.
arXiv Detail & Related papers (2023-07-20T17:33:25Z) - Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions
with Large Language Model [63.66204449776262]
Instruct2Act is a framework that maps multi-modal instructions to sequential actions for robotic manipulation tasks.
Our approach is adjustable and flexible in accommodating various instruction modalities and input types.
Our zero-shot method outperformed many state-of-the-art learning-based policies in several tasks.
arXiv Detail & Related papers (2023-05-18T17:59:49Z) - Not what you've signed up for: Compromising Real-World LLM-Integrated
Applications with Indirect Prompt Injection [64.67495502772866]
Large Language Models (LLMs) are increasingly being integrated into various applications.
We show how attackers can override original instructions and employed controls using Prompt Injection attacks.
We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities.
arXiv Detail & Related papers (2023-02-23T17:14:38Z) - MOCA: A Modular Object-Centric Approach for Interactive Instruction
Following [19.57344182656879]
We propose a modular architecture that decouples the task into visual perception and action policy.
We evaluate our method on the ALFRED benchmark and empirically validate that it outperforms prior arts.
arXiv Detail & Related papers (2020-12-06T07:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.