Agent Skills Enable a New Class of Realistic and Trivially Simple Prompt Injections
- URL: http://arxiv.org/abs/2510.26328v1
- Date: Thu, 30 Oct 2025 10:27:11 GMT
- Title: Agent Skills Enable a New Class of Realistic and Trivially Simple Prompt Injections
- Authors: David Schmotz, Sahar Abdelnabi, Maksym Andriushchenko,
- Abstract summary: A frontier LLM company made a step towards this by introducing Agent Skills.<n>We show that they are fundamentally insecure, since they enable trivially simple prompt injections.<n>We demonstrate how to hide malicious instructions in long Agent Skill files and referenced scripts to exfiltrate sensitive data.
- Score: 24.46526203453932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Enabling continual learning in LLMs remains a key unresolved research challenge. In a recent announcement, a frontier LLM company made a step towards this by introducing Agent Skills, a framework that equips agents with new knowledge based on instructions stored in simple markdown files. Although Agent Skills can be a very useful tool, we show that they are fundamentally insecure, since they enable trivially simple prompt injections. We demonstrate how to hide malicious instructions in long Agent Skill files and referenced scripts to exfiltrate sensitive data, such as internal files or passwords. Importantly, we show how to bypass system-level guardrails of a popular coding agent: a benign, task-specific approval with the "Don't ask again" option can carry over to closely related but harmful actions. Overall, we conclude that despite ongoing research efforts and scaling model capabilities, frontier LLMs remain vulnerable to very simple prompt injections in realistic scenarios. Our code is available at https://github.com/aisa-group/promptinject-agent-skills.
Related papers
- Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks [27.120130204872325]
We introduce SkillInject, a benchmark evaluating the susceptibility of widely-used LLM agents to injections through skill files.<n>SkillInject contains 202 injection-task pairs with attacks ranging from obviously malicious injections to subtle, context-dependent attacks hidden in otherwise legitimate instructions.<n>Our results show that today's agents are highly vulnerable with up to 80% attack success rate with frontier models.
arXiv Detail & Related papers (2026-02-23T18:59:27Z) - SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement [120.52289344734415]
We propose an automated framework for stealthy prompt injection tailored to agent skills.<n>The framework forms a closed loop with three agents: an Attack Agent that synthesizes injection skills under explicit stealth constraints, a Code Agent that executes tasks using the injected skills and an Evaluate Agent that logs action traces.<n>Our method consistently achieves high attack success rates under realistic settings.
arXiv Detail & Related papers (2026-02-15T16:09:48Z) - The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search [58.8834056209347]
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs.<n>We introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model's knowledge base.
arXiv Detail & Related papers (2025-12-01T07:05:23Z) - QueryIPI: Query-agnostic Indirect Prompt Injection on Coding Agents [13.098854359317523]
We present QueryIPI, the first query-agnostic IPI method for coding agents.<n>It refines malicious tool descriptions through an iterative, prompt-based process informed by the leaked internal prompt.<n>Experiments on five simulated agents show that QueryIPI achieves up to 87 percent success.
arXiv Detail & Related papers (2025-10-27T07:04:08Z) - AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents [54.29555239363013]
We propose a generic black-box fuzzing framework, AgentVigil, to automatically discover and exploit indirect prompt injection vulnerabilities.<n>We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o.<n>We apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
arXiv Detail & Related papers (2025-05-09T07:40:17Z) - Defeating Prompt Injections by Design [79.00910871948787]
CaMeL is a robust defense that creates a protective system layer around the Large Language Models.<n>To operate, CaMeL explicitly extracts the control and data flows from the (trusted) query.<n>To further improve security, CaMeL uses a notion of a capability to prevent the exfiltration of private data over unauthorized data flows.
arXiv Detail & Related papers (2025-03-24T15:54:10Z) - Automating Prompt Leakage Attacks on Large Language Models Using Agentic Approach [9.483655213280738]
This paper presents a novel approach to evaluating the security of large language models (LLMs)<n>We define prompt leakage as a critical threat to secure LLM deployment.<n>We implement a multi-agent system where cooperative agents are tasked with probing and exploiting the target LLM to elicit its prompt.
arXiv Detail & Related papers (2025-02-18T08:17:32Z) - LeakAgent: RL-based Red-teaming Agent for LLM Privacy Leakage [78.33839735526769]
LeakAgent is a novel black-box red-teaming framework for privacy leakage.<n>Our framework trains an open-source LLM through reinforcement learning as the attack agent to generate adversarial prompts.<n>We show that LeakAgent significantly outperforms existing rule-based approaches in training data extraction and automated methods in system prompt leakage.
arXiv Detail & Related papers (2024-12-07T20:09:01Z) - AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [84.96249955105777]
LLM agents may pose a greater risk if misused, but their robustness remains underexplored.<n>We propose a new benchmark called AgentHarm to facilitate research on LLM agent misuse.<n>We find leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking.
arXiv Detail & Related papers (2024-10-11T17:39:22Z) - BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents [26.057916556444333]
We show that such methods are vulnerable to our proposed backdoor attacks named BadAgent.
Our proposed attack methods are extremely robust even after fine-tuning on trustworthy data.
arXiv Detail & Related papers (2024-06-05T07:14:28Z) - Teams of LLM Agents can Exploit Zero-Day Vulnerabilities [3.494084149854375]
We show that teams of LLM agents can exploit real-world, zero-day vulnerabilities.<n>We introduce HPTSA, a system of agents with a planning agent that can launch subagents.<n>We construct a benchmark of 14 real-world vulnerabilities and show that our team of agents improve over prior agent frameworks by up to 4.3X.
arXiv Detail & Related papers (2024-06-02T16:25:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.