Related papers: Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks

Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks

URL: http://arxiv.org/abs/2410.20911v2
Date: Mon, 18 Nov 2024 09:15:46 GMT
Title: Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks
Authors: Dario Pasquini, Evgenios M. Kornaropoulos, Giuseppe Ateniese,
Abstract summary: Large language models (LLMs) are increasingly being harnessed to automate cyberattacks. Mantis is a framework that exploits LLMs' susceptibility to adversarial inputs to undermine malicious operations. Mantis consistently achieved over 95% effectiveness against automated LLM-driven attacks.
Score: 15.726286532500971
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are increasingly being harnessed to automate cyberattacks, making sophisticated exploits more accessible and scalable. In response, we propose a new defense strategy tailored to counter LLM-driven cyberattacks. We introduce Mantis, a defensive framework that exploits LLMs' susceptibility to adversarial inputs to undermine malicious operations. Upon detecting an automated cyberattack, Mantis plants carefully crafted inputs into system responses, leading the attacker's LLM to disrupt their own operations (passive defense) or even compromise the attacker's machine (active defense). By deploying purposefully vulnerable decoy services to attract the attacker and using dynamic prompt injections for the attacker's LLM, Mantis can autonomously hack back the attacker. In our experiments, Mantis consistently achieved over 95% effectiveness against automated LLM-driven attacks. To foster further research and collaboration, Mantis is available as an open-source tool: https://github.com/pasquini-dario/project_mantis

Related papers

Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models [55.28518567702213]
Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities.<n>This sequential approach creates a mismatch -- attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats.<n>We propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction.
arXiv Detail & Related papers (2025-06-09T06:35:12Z)
LLMs unlock new paths to monetizing exploits [85.60539289753564]
Large language models (LLMs) will soon alter the economics of cyberattacks.<n>LLMs enable adversaries to launch tailored attacks on a user-by-user basis.
arXiv Detail & Related papers (2025-05-16T17:05:25Z)
MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents [60.30753230776882]
LLM agents are vulnerable to indirect prompt injection (IPI) attacks, where malicious tasks embedded in tool-retrieved information can redirect the agent to take unauthorized actions.<n>We present MELON, a novel IPI defense that detects attacks by re-executing the agent's trajectory with a masked user prompt modified through a masking function.
arXiv Detail & Related papers (2025-02-07T18:57:49Z)
Defense Against Prompt Injection Attack by Leveraging Attack Techniques [66.65466992544728]
Large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks. As LLMs continue to evolve, new vulnerabilities, especially prompt injection attacks arise. Recent attack methods leverage LLMs' instruction-following abilities and their inabilities to distinguish instructions injected in the data content.
arXiv Detail & Related papers (2024-11-01T09:14:21Z)
The Best Defense is a Good Offense: Countering LLM-Powered Cyberattacks [2.6528263069045126]
Large language models (LLMs) could soon become integral to autonomous cyber agents. We introduce novel defense strategies that exploit the inherent vulnerabilities of attacking LLMs. Our results show defense success rates of up to 90%, demonstrating the effectiveness of turning LLM vulnerabilities into defensive strategies.
arXiv Detail & Related papers (2024-10-20T14:07:24Z)
Mitigating Backdoor Attack by Injecting Proactive Defensive Backdoor [63.84477483795964]
Data-poisoning backdoor attacks are serious security threats to machine learning models. In this paper, we focus on in-training backdoor defense, aiming to train a clean model even when the dataset may be potentially poisoned. We propose a novel defense approach called PDB (Proactive Defensive Backdoor)
arXiv Detail & Related papers (2024-05-25T07:52:26Z)
Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game [28.33029508522531]
Malicious attackers induce large models to jailbreak and generate information containing illegal, privacy-invasive information. Large models counter malicious attackers' attacks using techniques such as safety alignment. We propose a multi-agent attacker-disguiser game approach to achieve a weak defense mechanism that allows the large model to both safely reply to the attacker and hide the defense intent.
arXiv Detail & Related papers (2024-04-03T07:43:11Z)
AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacks [13.955084410934694]
Large language models (LLMs) have demonstrated impressive results on natural language tasks. As LLMs inevitably advance, they may be able to automate both the pre- and post-breach attack stages. This research can help defensive systems and teams learn to detect novel attack behaviors preemptively before their use in the wild.
arXiv Detail & Related papers (2024-03-02T00:10:45Z)
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses. We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z)
Backdoor Attack with Sparse and Invisible Trigger [57.41876708712008]
Deep neural networks (DNNs) are vulnerable to backdoor attacks. backdoor attack is an emerging yet threatening training-phase threat. We propose a sparse and invisible backdoor attack (SIBA)
arXiv Detail & Related papers (2023-05-11T10:05:57Z)
Fixed Points in Cyber Space: Rethinking Optimal Evasion Attacks in the Age of AI-NIDS [70.60975663021952]
We study blackbox adversarial attacks on network classifiers. We argue that attacker-defender fixed points are themselves general-sum games with complex phase transitions. We show that a continual learning approach is required to study attacker-defender dynamics.
arXiv Detail & Related papers (2021-11-23T23:42:16Z)
Arms Race in Adversarial Malware Detection: A Survey [33.8941961394801]
Malicious software (malware) is a major cyber threat that has to be tackled with Machine Learning (ML) techniques. ML is vulnerable to attacks known as adversarial examples. Knowing the defender's feature set is critical to the success of transfer attacks. The effectiveness of adversarial training depends on the defender's capability in identifying the most powerful attack.
arXiv Detail & Related papers (2020-05-24T07:20:42Z)
On Certifying Robustness against Backdoor Attacks via Randomized Smoothing [74.79764677396773]
We study the feasibility and effectiveness of certifying robustness against backdoor attacks using a recent technique called randomized smoothing. Our results show the theoretical feasibility of using randomized smoothing to certify robustness against backdoor attacks. Existing randomized smoothing methods have limited effectiveness at defending against backdoor attacks.
arXiv Detail & Related papers (2020-02-26T19:15:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.