AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models
- URL: http://arxiv.org/abs/2506.14682v1
- Date: Tue, 17 Jun 2025 16:19:06 GMT
- Title: AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models
- Authors: Ads Dawson, Rob Mulla, Nick Landers, Shane Caldwell,
- Abstract summary: We introduce AIRTBench, an AI red teaming benchmark for evaluating language models' ability to autonomously discover and exploit AI/ML security vulnerabilities.<n>The benchmark consists of 70 realistic black-box capture-the-flag (CTF) challenges from the Crucible challenge environment on the Dreadnode platform.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce AIRTBench, an AI red teaming benchmark for evaluating language models' ability to autonomously discover and exploit Artificial Intelligence and Machine Learning (AI/ML) security vulnerabilities. The benchmark consists of 70 realistic black-box capture-the-flag (CTF) challenges from the Crucible challenge environment on the Dreadnode platform, requiring models to write python code to interact with and compromise AI systems. Claude-3.7-Sonnet emerged as the clear leader, solving 43 challenges (61% of the total suite, 46.9% overall success rate), with Gemini-2.5-Pro following at 39 challenges (56%, 34.3% overall), GPT-4.5-Preview at 34 challenges (49%, 36.9% overall), and DeepSeek R1 at 29 challenges (41%, 26.9% overall). Our evaluations show frontier models excel at prompt injection attacks (averaging 49% success rates) but struggle with system exploitation and model inversion challenges (below 26%, even for the best performers). Frontier models are far outpacing open-source alternatives, with the best truly open-source model (Llama-4-17B) solving 7 challenges (10%, 1.0% overall), though demonstrating specialized capabilities on certain hard challenges. Compared to human security researchers, large language models (LLMs) solve challenges with remarkable efficiency completing in minutes what typically takes humans hours or days-with efficiency advantages of over 5,000x on hard challenges. Our contribution fills a critical gap in the evaluation landscape, providing the first comprehensive benchmark specifically designed to measure and track progress in autonomous AI red teaming capabilities.
Related papers
- OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z) - Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition [101.86739402748995]
We run the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios.<n>We build the Agent Red Teaming benchmark and evaluate it across 19 state-of-the-art models.<n>Our findings highlight critical and persistent vulnerabilities in today's AI agents.
arXiv Detail & Related papers (2025-07-28T05:13:04Z) - The Automation Advantage in AI Red Teaming [0.0]
This paper analyzes Large Language Model (LLM) security vulnerabilities based on data from Crucible.<n>Our findings reveal automated approaches significantly outperform manual techniques, despite only 5.2% of users employing automation.
arXiv Detail & Related papers (2025-04-28T14:48:00Z) - AI Cyber Risk Benchmark: Automated Exploitation Capabilities [0.24578723416255752]
We introduce a new benchmark for assessing AI models' capabilities and risks in automated software exploitation.<n>We evaluate several leading language models, including OpenAI's o1-preview and o1-mini, Anthropic's Claude-3.5-sonnet-20241022 and Claude-3.5-sonnet-20240620.
arXiv Detail & Related papers (2024-10-29T10:57:11Z) - SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories.
Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development.
We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z) - Uncovering Weaknesses in Neural Code Generation [21.552898575210534]
We assess the quality of generated code using match-based and execution-based metrics, then conduct thematic analysis to develop a taxonomy of nine types of weaknesses.
In the CoNaLa dataset, inaccurate prompts are a notable problem, causing all large models to fail in 26.84% of cases.
Missing pivotal semantics is a pervasive issue across benchmarks, with one or more large models omitting key semantics in 65.78% of CoNaLa tasks.
All models struggle with proper API usage, a challenge amplified by vague or complex prompts.
arXiv Detail & Related papers (2024-07-13T07:31:43Z) - SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages [64.10040374077994]
We introduce SEACrowd, a collaborative initiative that consolidates standardized corpora in nearly 1,000 languages across three modalities.<n>We assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA.
arXiv Detail & Related papers (2024-06-14T15:23:39Z) - ABAW : Facial Expression Recognition in the wild [3.823356975862006]
We have dealt only expression classification challenge using multiple approaches such as fully supervised, semi-supervised and noisy label approach.
Our approach using noise aware model has performed better than baseline model by 10.46%.
arXiv Detail & Related papers (2023-03-17T06:01:04Z) - How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language
Understanding Tasks [65.7949334650854]
GPT-3.5 models have demonstrated impressive performance in various Natural Language Processing (NLP) tasks.
However, their robustness and abilities to handle various complexities of the open world have yet to be explored.
We show that GPT-3.5 faces some specific robustness challenges, including instability, prompt sensitivity, and number sensitivity.
arXiv Detail & Related papers (2023-03-01T07:39:01Z) - CommonsenseQA 2.0: Exposing the Limits of AI through Gamification [126.85096257968414]
We construct benchmarks that test the abilities of modern natural language understanding models.
In this work, we propose gamification as a framework for data construction.
arXiv Detail & Related papers (2022-01-14T06:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.