Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models
- URL: http://arxiv.org/abs/2408.08926v3
- Date: Thu, 05 Dec 2024 19:46:36 GMT
- Title: Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models
- Authors: Andy K. Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W. Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Jasper, Pura Peetathawatchai, Ari Glenn, Vikram Sivashankar, Daniel Zamoshchin, Leo Glikbarg, Derek Askaryar, Mike Yang, Teddy Zhang, Rishi Alluri, Nathan Tran, Rinnara Sangpisit, Polycarpos Yiorkadjis, Kenny Osele, Gautham Raghupathi, Dan Boneh, Daniel E. Ho, Percy Liang,
- Abstract summary: Cybench is a framework for specifying cybersecurity tasks and evaluating agents on those tasks.<n>We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions.<n>We construct a cybersecurity agent and evaluate 8 models: GPT-4o, OpenAI o1-preview, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct.
- Score: 33.1538965735133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language Model (LM) agents for cybersecurity that are capable of autonomously identifying vulnerabilities and executing exploits have potential to cause real-world impact. Policymakers, model providers, and researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such agents to help mitigate cyberrisk and investigate opportunities for penetration testing. Toward that end, we introduce Cybench, a framework for specifying cybersecurity tasks and evaluating agents on those tasks. We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. Each task includes its own description, starter files, and is initialized in an environment where an agent can execute commands and observe outputs. Since many tasks are beyond the capabilities of existing LM agents, we introduce subtasks for each task, which break down a task into intermediary steps for a more detailed evaluation. To evaluate agent capabilities, we construct a cybersecurity agent and evaluate 8 models: GPT-4o, OpenAI o1-preview, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct. For the top performing models (GPT-4o and Claude 3.5 Sonnet), we further investigate performance across 4 agent scaffolds (structed bash, action-only, pseudoterminal, and web search). Without subtask guidance, agents leveraging Claude 3.5 Sonnet, GPT-4o, OpenAI o1-preview, and Claude 3 Opus successfully solved complete tasks that took human teams up to 11 minutes to solve. In comparison, the most difficult task took human teams 24 hours and 54 minutes to solve. All code and data are publicly available at https://cybench.github.io.
Related papers
- Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems [50.29939179830491]
Failure attribution in LLM multi-agent systems remains underexplored and labor-intensive.
We develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons.
The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps.
arXiv Detail & Related papers (2025-04-30T23:09:44Z) - D-CIPHER: Dynamic Collaborative Intelligent Agents with Planning and Heterogeneous Execution for Enhanced Reasoning in Offensive Security [22.86304661035188]
Large Language Models (LLMs) have been used in cybersecurity in many ways.
Capture the Flag (CTF) challenges serve as benchmarks for assessing the automated task-planning abilities of LLM agents.
We introduce the D-CIPHER multi-agent LLM framework for collaborative CTF challenge solving.
arXiv Detail & Related papers (2025-02-15T23:43:18Z) - TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks [52.46737975742287]
We build a self-contained environment with data that mimics a small software company environment.
We find that with the most competitive agent, 24% of the tasks can be completed autonomously.
This paints a nuanced picture on task automation with LM agents.
arXiv Detail & Related papers (2024-12-18T18:55:40Z) - Two Heads Are Better Than One: Collaborative LLM Embodied Agents for Human-Robot Interaction [1.6574413179773757]
Large language models (LLMs) should be able to leverage their large breadth of understanding to interpret natural language commands.
However, these models suffer from hallucinations, which may cause safety issues or deviations from the task.
In this research, multiple collaborative AI systems were tested against a single independent AI agent to determine whether the success in other domains would translate into improved human-robot interaction performance.
arXiv Detail & Related papers (2024-11-23T02:47:12Z) - Agent S: An Open Agentic Framework that Uses Computers Like a Human [31.16046798529319]
We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI)
Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces.
arXiv Detail & Related papers (2024-10-10T17:43:51Z) - On the Resilience of Multi-Agent Systems with Malicious Agents [58.79302663733702]
This paper investigates what is the resilience of multi-agent system structures under malicious agents.
We devise two methods, AutoTransform and AutoInject, to transform any agent into a malicious one.
We show that two defense methods, introducing a mechanism for each agent to challenge others' outputs, or an additional agent to review and correct messages, can enhance system resilience.
arXiv Detail & Related papers (2024-08-02T03:25:20Z) - MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups.
It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics.
With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z) - CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents [49.68117560675367]
Crab is the first benchmark framework designed to support cross-environment tasks.
Our framework supports multiple devices and can be easily extended to any environment with a Python interface.
The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
arXiv Detail & Related papers (2024-07-01T17:55:04Z) - Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning [67.26776442697184]
We introduce Husky, a holistic, open-source language agent that learns to reason over a unified action space.
Husky iterates between two stages: 1) generating the next action to take towards solving a given task and 2) executing the action using expert models.
Our experiments show that Husky outperforms prior language agents across 14 evaluation datasets.
arXiv Detail & Related papers (2024-06-10T17:07:25Z) - A Preliminary Study on Using Large Language Models in Software
Pentesting [2.0551676463612636]
Large language models (LLM) are perceived to offer promising potentials for automating security tasks.
We investigate the use of LLMs in software pentesting, where the main task is to automatically identify software security vulnerabilities in source code.
arXiv Detail & Related papers (2024-01-30T21:42:59Z) - MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation [96.71370747681078]
We introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM.
For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs.
We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate.
arXiv Detail & Related papers (2023-10-05T04:06:12Z) - AgentSims: An Open-Source Sandbox for Large Language Model Evaluation [9.156652770482268]
Existing evaluation methods suffer from following shortcomings: (1) constrained evaluation abilities, (2) vulnerable benchmarks, (3) unobjective metrics.
We suggest that task-based evaluation, where LLM agents complete tasks in a simulated environment, is a one-for-all solution to solve above problems.
AgentSims is an easy-to-use infrastructure for researchers from all disciplines to test the specific capacities they are interested in.
arXiv Detail & Related papers (2023-08-08T03:59:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.