Related papers: Generative Artificial Intelligence-Supported Pentesting: A Comparison between Claude Opus, GPT-4, and Copilot

Generative Artificial Intelligence-Supported Pentesting: A Comparison between Claude Opus, GPT-4, and Copilot

URL: http://arxiv.org/abs/2501.06963v1
Date: Sun, 12 Jan 2025 22:48:37 GMT
Title: Generative Artificial Intelligence-Supported Pentesting: A Comparison between Claude Opus, GPT-4, and Copilot
Authors: Antonio López Martínez, Alejandro Cano, Antonio Ruiz-Martínez,
Abstract summary: GenAI can be applied across numerous fields, with particular relevance in cybersecurity.<n>In this paper, we have analyzed the potential of leading generic-purpose GenAI tools.<n>Claude Opus, GPT-4 from ChatGPT, and Copilot-in augmenting the penetration testing process as defined by the Penetration Testing Execution Standard (PTES)
Score: 44.99833362998488
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The advent of Generative Artificial Intelligence (GenAI) has brought a significant change to our society. GenAI can be applied across numerous fields, with particular relevance in cybersecurity. Among the various areas of application, its use in penetration testing (pentesting) or ethical hacking processes is of special interest. In this paper, we have analyzed the potential of leading generic-purpose GenAI tools-Claude Opus, GPT-4 from ChatGPT, and Copilot-in augmenting the penetration testing process as defined by the Penetration Testing Execution Standard (PTES). Our analysis involved evaluating each tool across all PTES phases within a controlled virtualized environment. The findings reveal that, while these tools cannot fully automate the pentesting process, they provide substantial support by enhancing efficiency and effectiveness in specific tasks. Notably, all tools demonstrated utility; however, Claude Opus consistently outperformed the others in our experimental scenarios.

Related papers

Acting Less is Reasoning More! Teaching Model to Act Efficiently [87.28134636548705]
Tool-integrated reasoning augments large language models with the ability to invoke external tools to solve tasks.<n>Current approaches typically optimize only for final correctness without considering the efficiency or necessity of external tool use.<n>We propose a framework that encourages models to produce accurate answers with minimal tool calls.<n>Our approach reduces tool calls by up to 68.3% and improves tool productivity by up to 215.4%, while maintaining comparable answer accuracy.
arXiv Detail & Related papers (2025-04-21T05:40:05Z)
General Scales Unlock AI Evaluation with Explanatory and Predictive Power [57.7995945974989]
benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems. We introduce general scales for AI evaluation that can explain what common AI benchmarks really measure. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate.
arXiv Detail & Related papers (2025-03-09T01:13:56Z)
Computational Safety for Generative AI: A Signal Processing Perspective [65.268245109828]
computational safety is a mathematical framework that enables the quantitative assessment, formulation, and study of safety challenges in GenAI. We show how sensitivity analysis and loss landscape analysis can be used to detect malicious prompts with jailbreak attempts. We discuss key open research challenges, opportunities, and the essential role of signal processing in computational AI safety.
arXiv Detail & Related papers (2025-02-18T02:26:50Z)
VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework [4.802551205178858]
Existing large language model (LLM)-assisted or automated penetration testing approaches often suffer from inefficiencies. VulnBot decomposes complex tasks into three specialized phases: reconnaissance, scanning, and exploitation. Key design features include role specialization, penetration path planning, inter-agent communication, and generative penetration behavior.
arXiv Detail & Related papers (2025-01-23T06:33:05Z)
The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We conduct the first large-scale, multi-benchmark web agent experiment.<n>Results highlight a large discrepancy between OpenAI and Anthropic's latests models.
arXiv Detail & Related papers (2024-12-06T23:43:59Z)
AI-Augmented Ethical Hacking: A Practical Examination of Manual Exploitation and Privilege Escalation in Linux Environments [2.3020018305241337]
This study explores the application of generative AI (GenAI) within manual exploitation and privilege escalation tasks in Linux-based penetration testing environments. Our findings demonstrate that GenAI can streamline processes, such as identifying potential attack vectors and parsing complex outputs for sensitive data during privilege escalation.
arXiv Detail & Related papers (2024-11-26T15:55:15Z)
AI-Compass: A Comprehensive and Effective Multi-module Testing Tool for AI Systems [26.605694684145313]
In this study, we design and implement a testing tool, tool, to comprehensively and effectively evaluate AI systems. The tool extensively assesses adversarial robustness, model interpretability, and performs neuron analysis. Our research sheds light on a general solution for AI systems testing landscape.
arXiv Detail & Related papers (2024-11-09T11:15:17Z)
Disrupting Test Development with AI Assistants [1.024113475677323]
Generative AI-assisted coding tools like GitHub Copilot, ChatGPT, and Tabnine have significantly transformed software development. This paper analyzes how these innovations impact productivity and software test development metrics.
arXiv Detail & Related papers (2024-11-04T17:52:40Z)
AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs. Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z)
Hacking, The Lazy Way: LLM Augmented Pentesting [0.0]
We introduce a new concept called "LLM Augmented Pentesting" demonstrated with a tool named "Pentest Copilot"<n>Our approach focuses on overcoming the traditional resistance to automation in penetration testing by employing LLMs to automate specific sub-tasks.<n>Pentest Copilot showcases remarkable proficiency in tasks such as utilizing testing tools, interpreting outputs, and suggesting follow-up actions.
arXiv Detail & Related papers (2024-09-14T17:40:35Z)
AI-powered test automation tools: A systematic review and empirical evaluation [1.3490988186255937]
We investigate the features provided by existing AI-based test automation tools. We empirically evaluate how the AI features can be helpful for effectiveness and efficiency of testing. We also study the limitations of the AI features in AI-based test tools.
arXiv Detail & Related papers (2024-08-31T10:10:45Z)
CIPHER: Cybersecurity Intelligent Penetration-testing Helper for Ethical Researcher [1.6652242654250329]
We develop CIPHER (Cybersecurity Intelligent Penetration-testing Helper for Ethical Researchers), a large language model specifically trained to assist in penetration testing tasks. We trained CIPHER using over 300 high-quality write-ups of vulnerable machines, hacking techniques, and documentation of open-source penetration testing tools. We introduce the Findings, Action, Reasoning, and Results (FARR) Flow augmentation, a novel method to augment penetration testing write-ups to establish a fully automated pentesting simulation benchmark.
arXiv Detail & Related papers (2024-08-21T14:24:04Z)
PyTrial: Machine Learning Software and Benchmark for Clinical Trial Applications [49.69824178329405]
PyTrial provides benchmarks and open-source implementations of a series of machine learning algorithms for clinical trial design and operations. We thoroughly investigate 34 ML algorithms for clinical trials across 6 different tasks, including patient outcome prediction, trial site selection, trial outcome prediction, patient-trial matching, trial similarity search, and synthetic data generation. PyTrial defines each task through a simple four-step process: data loading, model specification, model training, and model evaluation, all achievable with just a few lines of code.
arXiv Detail & Related papers (2023-06-06T21:19:03Z)
AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges [60.56413461109281]
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big data generated by IT Operations processes. We discuss in depth the key types of data emitted by IT Operations activities, the scale and challenges in analyzing them, and where they can be helpful. We categorize the key AIOps tasks as - incident detection, failure prediction, root cause analysis and automated actions.
arXiv Detail & Related papers (2023-04-10T15:38:12Z)
Realistic simulation of users for IT systems in cyber ranges [63.20765930558542]
We instrument each machine by means of an external agent to generate user activity. This agent combines both deterministic and deep learning based methods to adapt to different environment. We also propose conditional text generation models to facilitate the creation of conversations and documents.
arXiv Detail & Related papers (2021-11-23T10:53:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.