Related papers: AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses

AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses

URL: http://arxiv.org/abs/2503.01811v1
Date: Mon, 03 Mar 2025 18:39:48 GMT
Title: AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses
Authors: Nicholas Carlini, Javier Rando, Edoardo Debenedetti, Milad Nasr, Florian Tramèr,
Abstract summary: AutoAdvExBench is a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples.<n>We design a strong agent that is capable of breaking 75% of CTF-like ("homework exercise") adversarial example defenses.<n>We show that this agent is only able to succeed on 13% of the real-world defenses in our benchmark, indicating the large gap between difficulty in attacking "real" code, and CTF-like code.
Score: 66.87883360545361
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, bench directly measures LLMs' success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in bench, it would immediately present practical utility for adversarial machine learning researchers. We then design a strong agent that is capable of breaking 75% of CTF-like ("homework exercise") adversarial example defenses. However, we show that this agent is only able to succeed on 13% of the real-world defenses in our benchmark, indicating the large gap between difficulty in attacking "real" code, and CTF-like code. In contrast, a stronger LLM that can attack 21% of real defenses only succeeds on 54% of CTF-like defenses. We make this benchmark available at https://github.com/ethz-spylab/AutoAdvExBench.

Related papers

Benchmarking Misuse Mitigation Against Covert Adversaries [80.74502950627736]
Existing language model safety evaluations focus on overt attacks and low-stakes tasks.<n>We develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses.<n>Our evaluations indicate that decomposition attacks are effective misuse enablers, and highlight stateful defenses as a countermeasure.
arXiv Detail & Related papers (2025-06-06T17:33:33Z)
AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage. We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility. Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z)
ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models [55.93380086403591]
Generative large language models are vulnerable to backdoor attacks.<n>$textitELBA-Bench$ allows attackers to inject backdoor through parameter efficient fine-tuning.<n>$textitELBA-Bench$ provides over 1300 experiments.
arXiv Detail & Related papers (2025-02-22T12:55:28Z)
The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense [56.32083100401117]
We investigate why Vision Large Language Models (VLLMs) are prone to jailbreak attacks. We then make a key observation: existing defense mechanisms suffer from an textbfover-prudence problem. We find that the two representative evaluation methods for jailbreak often exhibit chance agreement.
arXiv Detail & Related papers (2024-11-13T07:57:19Z)
The Best Defense is a Good Offense: Countering LLM-Powered Cyberattacks [2.6528263069045126]
Large language models (LLMs) could soon become integral to autonomous cyber agents. We introduce novel defense strategies that exploit the inherent vulnerabilities of attacking LLMs. Our results show defense success rates of up to 90%, demonstrating the effectiveness of turning LLM vulnerabilities into defensive strategies.
arXiv Detail & Related papers (2024-10-20T14:07:24Z)
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents [32.62654499260479]
Agent Security Bench (ASB) is a framework designed to formalize, benchmark, and evaluate the attacks and defenses of LLM-based agents. We benchmark 10 prompt injection attacks, a memory poisoning attack, a novel Plan-of-Thought backdoor attack, a mixed attack, and 10 corresponding defenses across 13 LLM backbones. Our benchmark results reveal critical vulnerabilities in different stages of agent operation, including system prompt, user prompt handling, tool usage, and memory retrieval.
arXiv Detail & Related papers (2024-10-03T16:30:47Z)
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner [21.414701448926614]
This paper introduces a generic LLM jailbreak defense framework called SelfDefend.<n>We empirically validate using mainstream GPT-3.5/4 models against major jailbreak attacks.<n>To further improve the defense's robustness and minimize costs, we employ a data distillation approach to tune dedicated open-source defense models.
arXiv Detail & Related papers (2024-06-08T15:45:31Z)
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks [20.5016054418053]
AutoDefense is a multi-agent defense framework that filters harmful responses from large language models. Our framework is robust against different jailbreak attack prompts, and can be used to defend different victim models.
arXiv Detail & Related papers (2024-03-02T16:52:22Z)
Baseline Defenses for Adversarial Attacks Against Aligned Language Models [109.75753454188705]
Recent work shows that text moderations can produce jailbreaking prompts that bypass defenses. We look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We find that the weakness of existing discretes for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs.
arXiv Detail & Related papers (2023-09-01T17:59:44Z)
A LLM Assisted Exploitation of AI-Guardian [57.572998144258705]
We evaluate the robustness of AI-Guardian, a recent defense to adversarial examples published at IEEE S&P 2023. We write none of the code to attack this model, and instead prompt GPT-4 to implement all attack algorithms following our instructions and guidance. This process was surprisingly effective and efficient, with the language model at times producing code from ambiguous instructions faster than the author of this paper could have done.
arXiv Detail & Related papers (2023-07-20T17:33:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.