Related papers: Ctrl-Z: Controlling AI Agents via Resampling

Ctrl-Z: Controlling AI Agents via Resampling

URL: http://arxiv.org/abs/2504.10374v1
Date: Mon, 14 Apr 2025 16:22:11 GMT
Title: Ctrl-Z: Controlling AI Agents via Resampling
Authors: Aryan Bhatt, Cody Rushing, Adam Kaufman, Tyler Tracy, Vasil Georgiev, David Matolcsi, Akbir Khan, Buck Shlegeris,
Abstract summary: We construct BashBench, a dataset of 257 challenging multi-step system administration tasks.<n>We evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code.
Score: 2.505902477669827
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. Our work presents the first control evaluation performed in an agent environment. We construct BashBench, a dataset of 257 challenging multi-step system administration tasks, and evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code in this environment. This multi-step setting introduces new attack and defense dynamics, which we investigate in order to design novel control protocols that prevent safety failures without hindering the ability of non-malicious agents to perform useful work. We introduce a class of control protocols called resample protocols that dynamically take additional samples of certain actions. We find these protocols significantly improve on existing techniques by selectively blocking the AI agent from executing suspicious code and incriminating the agent by generating additional examples of dangerous behavior. We measure the tradeoff between attack prevention and usefulness; our best protocol combines resampling with analysis of previous steps, reducing the success rate of attacks from 58% to 7% at a 5% cost to the performance of a non-malicious agent.

Related papers

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence [4.027240141373031]
We propose a framework for adapting affordances of red teams to advancing AI capabilities.<n>We show how knowledge of an agents's actual capability profile can inform proportional control evaluations.
arXiv Detail & Related papers (2025-04-07T16:52:52Z)
Multi-Agent Security Tax: Trading Off Security and Collaboration Capabilities in Multi-Agent Systems [1.2564343689544843]
We develop simulations of AI agents collaborating on shared objectives to study security risks and trade-offs.<n>We observe infectious malicious prompts - the multi-hop spreading of malicious instructions.<n>Our findings illustrate potential trade-off between security and collaborative efficiency in multi-agent systems.
arXiv Detail & Related papers (2025-02-26T14:00:35Z)
MELON: Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison [60.30753230776882]
LLM agents are vulnerable to indirect prompt injection (IPI) attacks.<n>We present MELON, a novel IPI defense.<n>We show that MELON outperforms SOTA defenses in both attack prevention and utility preservation.
arXiv Detail & Related papers (2025-02-07T18:57:49Z)
CP-Guard+: A New Paradigm for Malicious Agent Detection and Defense in Collaborative Perception [53.088988929450494]
Collaborative perception (CP) is a promising method for safe connected and autonomous driving.<n>We propose a new paradigm for malicious agent detection that effectively identifies malicious agents at the feature level.<n>We also develop a robust defense method called CP-Guard+, which enhances the margin between the representations of benign and malicious features.
arXiv Detail & Related papers (2025-02-07T12:58:45Z)
Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In [5.65782619470663]
We examine how ReAct agents can be exploited using a straightforward yet effective method we refer to as the foot-in-the-door attack. Our experiments show that indirect prompt injection attacks can significantly increase the likelihood of the agent performing subsequent malicious actions. To mitigate this vulnerability, we propose implementing a simple reflection mechanism that prompts the agent to reassess the safety of its actions during execution.
arXiv Detail & Related papers (2024-10-22T12:24:41Z)
Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols [52.40622903199512]
This paper introduces AI-Control Games, a formal decision-making model of the red-teaming exercise as a multi-objective, partially observable game. We apply our formalism to model, evaluate and synthesise protocols for deploying untrusted language models as programming assistants.
arXiv Detail & Related papers (2024-09-12T12:30:07Z)
SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks [53.28390057407576]
Modern NLP models are often trained on public datasets drawn from diverse sources. Data poisoning attacks can manipulate the model's behavior in ways engineered by the attacker. Several strategies have been proposed to mitigate the risks associated with backdoor attacks.
arXiv Detail & Related papers (2024-05-19T14:50:09Z)
Malicious Agent Detection for Robust Multi-Agent Collaborative Perception [52.261231738242266]
Multi-agent collaborative (MAC) perception is more vulnerable to adversarial attacks than single-agent perception. We propose Malicious Agent Detection (MADE), a reactive defense specific to MAC perception. We conduct comprehensive evaluations on a benchmark 3D dataset V2X-sim and a real-road dataset DAIR-V2X.
arXiv Detail & Related papers (2023-10-18T11:36:42Z)
Balancing detectability and performance of attacks on the control channel of Markov Decision Processes [77.66954176188426]
We investigate the problem of designing optimal stealthy poisoning attacks on the control channel of Markov decision processes (MDPs) This research is motivated by the recent interest of the research community for adversarial and poisoning attacks applied to MDPs, and reinforcement learning (RL) methods.
arXiv Detail & Related papers (2021-09-15T09:13:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.