Related papers: AI Cyber Risk Benchmark: Automated Exploitation Capabilities

Related papers

Proactively Detecting Threats: A Novel Approach Using LLMs [0.0]
This paper presents the first systematic evaluation of large language models (LLMs) to proactively identify indicators of compromise (IOCs)<n>We developed an automated system that pulls IOCs from 15 web-based threat report sources to evaluate six LLM models.<n>Gemini 1.5 Pro achieved 0.958 precision and 0.788 specificity for malicious IOC identification, while demonstrating perfect recall (1.0) for actual threats.
arXiv Detail & Related papers (2026-01-13T23:28:33Z)
Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing [83.48116811975787]
We present the first comprehensive evaluation of AI agents against human cybersecurity professionals.<n>We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold.<n>ARTEMIS placed second overall, discovering 9 valid vulnerabilities with an 82% valid submission rate.
arXiv Detail & Related papers (2025-12-10T18:12:29Z)
Toward Quantitative Modeling of Cybersecurity Risks Due to AI Misuse [50.87630846876635]
We develop nine detailed cyber risk models.<n>Each model decomposes attacks into steps using the MITRE ATT&CK framework.<n>Individual estimates are aggregated through Monte Carlo simulation.
arXiv Detail & Related papers (2025-12-09T17:54:17Z)
Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models [0.0]
Large language models (LLMs) remain vulnerable to sophisticated prompt engineering attacks.<n>We introduce Jailbreak Mimicry, a systematic methodology for training compact attacker models to automatically generate narrative-based jailbreak prompts.<n>Our approach transforms adversarial prompt discovery from manual craftsmanship into a reproducible scientific process.
arXiv Detail & Related papers (2025-10-24T23:53:16Z)
CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale [46.76144797837242]
Large language model (LLM) agents are becoming increasingly skilled at handling cybersecurity tasks autonomously.<n>Existing benchmarks fall short, often failing to capture real-world scenarios or being limited in scope.<n>We introduce CyberGym, a large-scale and high-quality cybersecurity evaluation framework featuring 1,507 real-world vulnerabilities.
arXiv Detail & Related papers (2025-06-03T07:35:14Z)
BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems [62.17474934536671]
We introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems.<n>To capture the vulnerability lifecycle, we define three task types: Detect (detecting a new vulnerability), Exploit (exploiting a specific vulnerability), and Patch (patching a specific vulnerability)<n>We evaluate 8 agents: Claude Code, OpenAI Codex CLI with o3-high and o4-mini, and custom agents with o3-high, GPT-4.1, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet Thinking, and DeepSeek-R1.
arXiv Detail & Related papers (2025-05-21T07:44:52Z)
CAI: An Open, Bug Bounty-Ready Cybersecurity AI [0.3889280708089931]
Cybersecurity AI (CAI) is an open-source framework that democratizes advanced security testing through specialized AI agents. We demonstrate that CAI consistently outperforms state-of-the-art results in CTF benchmarks. CAI reached top-30 in Spain and top-500 worldwide on Hack The Box within a week.
arXiv Detail & Related papers (2025-04-08T13:22:09Z)
How Well Can AI Build SD Models? [0.0]
We introduce two metrics for evaluation of AI-generated causal maps: technical correctness (causal translation) and adherence to instructions (conformance) We tested 11 different LLMs on their ability to do causal translation as well as conform to user instruction.
arXiv Detail & Related papers (2025-03-19T14:48:47Z)
General Scales Unlock AI Evaluation with Explanatory and Predictive Power [57.7995945974989]
benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems. We introduce general scales for AI evaluation that can explain what common AI benchmarks really measure. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate.
arXiv Detail & Related papers (2025-03-09T01:13:56Z)
WebGames: Challenging General-Purpose Web-Browsing AI Agents [11.320069795732058]
WebGames is a comprehensive benchmark suite designed to evaluate general-purpose web-browsing AI agents. We evaluate leading vision-language models including GPT-4o, Claude Computer-Use, Gemini-1.5-Pro, and Qwen2-VL against human performance. Results reveal a substantial capability gap, with the best AI system achieving only 43.1% success rate compared to human performance of 95.7%.
arXiv Detail & Related papers (2025-02-25T16:45:08Z)
o3-mini vs DeepSeek-R1: Which One is Safer? [6.105030666773317]
DeepSeek-R1 constitutes a turning point for the AI industry. OpenAI's o3-mini model is expected to set high standards in terms of performance, safety and cost. We make use of our recently released automated safety testing tool, named ASTRAL.
arXiv Detail & Related papers (2025-01-30T15:45:56Z)
Fundamental Risks in the Current Deployment of General-Purpose AI Models: What Have We (Not) Learnt From Cybersecurity? [60.629883024152576]
Large Language Models (LLMs) have seen rapid deployment in a wide range of use cases. OpenAIs Altera are just a few examples of increased autonomy, data access, and execution capabilities. These methods come with a range of cybersecurity challenges.
arXiv Detail & Related papers (2024-12-19T14:44:41Z)
ChatNVD: Advancing Cybersecurity Vulnerability Assessment with Large Language Models [0.46873264197900916]
ChatNVD is a support tool powered by Large Language Models (LLMs) to generate accessible, context-rich summaries of software vulnerabilities.<n>We develop three variants of ChatNVD, utilizing three prominent LLMs: GPT-4o Mini by OpenAI, LLaMA 3 by Meta, and Gemini 1.5 Pro by Google.<n>Our results demonstrate that GPT-4o Mini outperforms the other models, achieving over 92% accuracy and the lowest error rates.
arXiv Detail & Related papers (2024-12-06T03:45:49Z)
AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs. Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z)
Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements [1.4433703131122861]
Large language models (LLMs) have shown potential across various domains, including cybersecurity. There is currently no comprehensive, open, end-to-end automated penetration testing benchmark. This paper introduces a novel open benchmark for LLM-based automated penetration testing.
arXiv Detail & Related papers (2024-10-22T16:18:41Z)
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model [69.08287909042421]
We show that OpenAI's o1 model has achieved the best performance on most datasets. We also provide a detailed analysis on several reasoning benchmarks.
arXiv Detail & Related papers (2024-10-17T15:09:03Z)
Unlearn and Burn: Adversarial Machine Unlearning Requests Destroy Model Accuracy [65.80757820884476]
We expose a critical yet underexplored vulnerability in the deployment of unlearning systems. We present a threat model where an attacker can degrade model accuracy by submitting adversarial unlearning requests for data not present in the training set. We evaluate various verification mechanisms to detect the legitimacy of unlearning requests and reveal the challenges in verification.
arXiv Detail & Related papers (2024-10-12T16:47:04Z)
Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design [63.24275274981911]
Compound AI Systems consisting of many language model inference calls are increasingly employed. In this work, we construct systems, which we call Networks of Networks (NoNs) organized around the distinction between generating a proposed answer and verifying its correctness. We introduce a verifier-based judge NoN with K generators, an instantiation of "best-of-K" or "judge-based" compound AI systems.
arXiv Detail & Related papers (2024-07-23T20:40:37Z)
Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning [16.54022485688803]
VulLLM is a novel framework that integrates multi-task learning with Large Language Models (LLMs) to effectively mine deep-seated vulnerability features. The experiments conducted on six large datasets demonstrate that VulLLM surpasses seven state-of-the-art models in terms of effectiveness, generalization, and robustness.
arXiv Detail & Related papers (2024-06-06T03:29:05Z)
Introducing v0.5 of the AI Safety Benchmark from MLCommons [101.98401637778638]
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models.
arXiv Detail & Related papers (2024-04-18T15:01:00Z)
Evaluating Frontier Models for Dangerous Capabilities [59.129424649740855]
We introduce a programme of "dangerous capability" evaluations and pilot them on Gemini 1.0 models. Our evaluations cover four areas: (1) persuasion and deception; (2) cyber-security; (3) self-proliferation; and (4) self-reasoning. Our goal is to help advance a rigorous science of dangerous capability evaluation, in preparation for future models.
arXiv Detail & Related papers (2024-03-20T17:54:26Z)
Securing Graph Neural Networks in MLaaS: A Comprehensive Realization of Query-based Integrity Verification [68.86863899919358]
We introduce a groundbreaking approach to protect GNN models in Machine Learning from model-centric attacks. Our approach includes a comprehensive verification schema for GNN's integrity, taking into account both transductive and inductive GNNs. We propose a query-based verification technique, fortified with innovative node fingerprint generation algorithms.
arXiv Detail & Related papers (2023-12-13T03:17:05Z)
HuntGPT: Integrating Machine Learning-Based Anomaly Detection and Explainable AI with Large Language Models (LLMs) [0.09208007322096533]
We present HuntGPT, a specialized intrusion detection dashboard applying a Random Forest classifier. The paper delves into the system's architecture, components, and technical accuracy, assessed through Certified Information Security Manager (CISM) Practice Exams. The results demonstrate that conversational agents, supported by LLM and integrated with XAI, provide robust, explainable, and actionable AI solutions in intrusion detection.
arXiv Detail & Related papers (2023-09-27T20:58:13Z)
Getting pwn'd by AI: Penetration Testing with Large Language Models [0.0]
This paper explores the potential usage of large-language models, such as GPT3.5, to augment penetration testers with AI sparring partners. We explore the feasibility of supplementing penetration testers with AI models for two distinct use cases: high-level task planning for security testing assignments and low-level vulnerability hunting within a vulnerable virtual machine.
arXiv Detail & Related papers (2023-07-24T19:59:22Z)
A LLM Assisted Exploitation of AI-Guardian [57.572998144258705]
We evaluate the robustness of AI-Guardian, a recent defense to adversarial examples published at IEEE S&P 2023. We write none of the code to attack this model, and instead prompt GPT-4 to implement all attack algorithms following our instructions and guidance. This process was surprisingly effective and efficient, with the language model at times producing code from ambiguous instructions faster than the author of this paper could have done.
arXiv Detail & Related papers (2023-07-20T17:33:25Z)
When Authentication Is Not Enough: On the Security of Behavioral-Based Driver Authentication Systems [53.2306792009435]
We develop two lightweight driver authentication systems based on Random Forest and Recurrent Neural Network architectures. We are the first to propose attacks against these systems by developing two novel evasion attacks, SMARTCAN and GANCAN. Through our contributions, we aid practitioners in safely adopting these systems, help reduce car thefts, and enhance driver security.
arXiv Detail & Related papers (2023-06-09T14:33:26Z)
OutCenTR: A novel semi-supervised framework for predicting exploits of vulnerabilities in high-dimensional datasets [0.0]
We make use of outlier detection techniques to predict vulnerabilities that are likely to be exploited. We propose a dimensionality reduction technique, OutCenTR, that enhances the baseline outlier detection models. The results of our experiments show on average a 5-fold improvement of F1 score in comparison with state-of-the-art dimensionality reduction techniques.
arXiv Detail & Related papers (2023-04-03T00:34:41Z)
Publishing Efficient On-device Models Increases Adversarial Vulnerability [58.6975494957865]
In this paper, we study the security considerations of publishing on-device variants of large-scale models. We first show that an adversary can exploit on-device models to make attacking the large models easier. We then show that the vulnerability increases as the similarity between a full-scale and its efficient model increase.
arXiv Detail & Related papers (2022-12-28T05:05:58Z)
VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code. Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph. VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z)
Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.