Generative Artificial Intelligence-Supported Pentesting: A Comparison between Claude Opus, GPT-4, and Copilot
- URL: http://arxiv.org/abs/2501.06963v2
- Date: Tue, 26 Aug 2025 16:03:41 GMT
- Title: Generative Artificial Intelligence-Supported Pentesting: A Comparison between Claude Opus, GPT-4, and Copilot
- Authors: Antonio López Martínez, Alejandro Cano, Antonio Ruiz-Martínez,
- Abstract summary: GenAI can be applied across numerous fields, with particular relevance in cybersecurity.<n>In this paper, we have analyzed the potential of leading generic-purpose GenAI tools.<n>Claude Opus, GPT-4 from ChatGPT, and Copilot-in augmenting the penetration testing process as defined by the Penetration Testing Execution Standard (PTES)
- Score: 42.558423984270135
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advent of Generative Artificial Intelligence (GenAI) has brought a significant change to our society. GenAI can be applied across numerous fields, with particular relevance in cybersecurity. Among the various areas of application, its use in penetration testing (pentesting) or ethical hacking processes is of special interest. In this paper, we have analyzed the potential of leading generic-purpose GenAI tools-Claude Opus, GPT-4 from ChatGPT, and Copilot-in augmenting the penetration testing process as defined by the Penetration Testing Execution Standard (PTES). Our analysis involved evaluating each tool across all PTES phases within a controlled virtualized environment. The findings reveal that, while these tools cannot fully automate the pentesting process, they provide substantial support by enhancing efficiency and effectiveness in specific tasks. Notably, all tools demonstrated utility; however, Claude Opus consistently outperformed the others in our experimental scenarios.
Related papers
- Impacts of Generative AI on Agile Teams' Productivity: A Multi-Case Longitudinal Study [5.9568322124195845]
Generative Artificial Intelligence (GenAI) tools represent a paradigm shift in software engineering.<n>This study aims to provide a longitudinal evaluation of GenAI's impact on agile software teams.
arXiv Detail & Related papers (2026-02-14T13:26:16Z) - SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents [100.12367115920121]
We introduce SciGymAgent, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines.<n>We also present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities.
arXiv Detail & Related papers (2026-02-13T14:58:18Z) - Adoption of Generative Artificial Intelligence in the German Software Engineering Industry: An Empirical Study [9.442926409509038]
Generative artificial intelligence (GenAI) tools have seen rapid adoption among software developers.<n>While adoption rates in the industry are rising, the underlying factors influencing the effective use of these tools have not been thoroughly investigated.<n>This issue is particularly relevant in environments with stringent regulatory requirements, such as Germany.<n>No empirical study has systematically examined the adoption dynamics of GenAI tools within the German context.
arXiv Detail & Related papers (2026-01-23T12:42:33Z) - OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents [49.34040731113563]
We present OSWorld-MCP, the first comprehensive and fair benchmark for assessing computer-use agents' tool invocation, GUI operation, and decision-making abilities.<n> Rigorous manual validation yields 158 high-quality tools, each verified for correct functionality, practical applicability, and versatility.<n> OSWorld-MCP deepens understanding of multimodal agents and sets a new standard for evaluating performance in complex, tool-assisted environments.
arXiv Detail & Related papers (2025-10-28T15:56:36Z) - AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite [75.58737079136942]
We present AstaBench, a suite that provides the first holistic measure of agentic ability to perform scientific research.<n>Our suite comes with the first scientific research environment with production-grade search tools.<n>Our evaluation of 57 agents across 22 agent classes reveals several interesting findings.
arXiv Detail & Related papers (2025-10-24T17:10:26Z) - DeepAgent: A General Reasoning Agent with Scalable Toolsets [111.6384541877723]
DeepAgent is an end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution.<n>To address the challenges of long-horizon interactions, we introduce an autonomous memory folding mechanism that compresses past interactions into structured episodic, working, and tool memories.<n>We develop an end-to-end reinforcement learning strategy, namely ToolPO, that leverages LLM-simulated APIs and applies tool-call advantage attribution to assign fine-grained credit to the tool invocation tokens.
arXiv Detail & Related papers (2025-10-24T16:24:01Z) - Will AI also replace inspectors? Investigating the potential of generative AIs in usability inspection [0.0]
This study examines the performance of generative AIs in identifying usability problems, comparing them to those of experienced human inspectors.<n>While inspectors achieved the highest levels of precision and overall coverage, the AIs demonstrated high individual performance and discovered many novel defects, but with a higher rate of false positives and redundant reports.<n>These findings suggest that AI, in its current stage, cannot replace human inspectors but can serve as a valuable augmentation tool to improve efficiency and expand defect coverage.
arXiv Detail & Related papers (2025-10-19T23:59:15Z) - Acting Less is Reasoning More! Teaching Model to Act Efficiently [87.28134636548705]
Tool-integrated reasoning augments large language models with the ability to invoke external tools to solve tasks.<n>Current approaches typically optimize only for final correctness without considering the efficiency or necessity of external tool use.<n>We propose a framework that encourages models to produce accurate answers with minimal tool calls.<n>Our approach reduces tool calls by up to 68.3% and improves tool productivity by up to 215.4%, while maintaining comparable answer accuracy.
arXiv Detail & Related papers (2025-04-21T05:40:05Z) - General Scales Unlock AI Evaluation with Explanatory and Predictive Power [57.7995945974989]
benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems.
We introduce general scales for AI evaluation that can explain what common AI benchmarks really measure.
Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate.
arXiv Detail & Related papers (2025-03-09T01:13:56Z) - Computational Safety for Generative AI: A Signal Processing Perspective [65.268245109828]
computational safety is a mathematical framework that enables the quantitative assessment, formulation, and study of safety challenges in GenAI.
We show how sensitivity analysis and loss landscape analysis can be used to detect malicious prompts with jailbreak attempts.
We discuss key open research challenges, opportunities, and the essential role of signal processing in computational AI safety.
arXiv Detail & Related papers (2025-02-18T02:26:50Z) - VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework [4.802551205178858]
Existing large language model (LLM)-assisted or automated penetration testing approaches often suffer from inefficiencies.
VulnBot decomposes complex tasks into three specialized phases: reconnaissance, scanning, and exploitation.
Key design features include role specialization, penetration path planning, inter-agent communication, and generative penetration behavior.
arXiv Detail & Related papers (2025-01-23T06:33:05Z) - The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We conduct the first large-scale, multi-benchmark web agent experiment.<n>Results highlight a large discrepancy between OpenAI and Anthropic's latests models.
arXiv Detail & Related papers (2024-12-06T23:43:59Z) - AI-Augmented Ethical Hacking: A Practical Examination of Manual Exploitation and Privilege Escalation in Linux Environments [2.3020018305241337]
This study explores the application of generative AI (GenAI) within manual exploitation and privilege escalation tasks in Linux-based penetration testing environments.
Our findings demonstrate that GenAI can streamline processes, such as identifying potential attack vectors and parsing complex outputs for sensitive data during privilege escalation.
arXiv Detail & Related papers (2024-11-26T15:55:15Z) - AI-Compass: A Comprehensive and Effective Multi-module Testing Tool for AI Systems [26.605694684145313]
In this study, we design and implement a testing tool, tool, to comprehensively and effectively evaluate AI systems.
The tool extensively assesses adversarial robustness, model interpretability, and performs neuron analysis.
Our research sheds light on a general solution for AI systems testing landscape.
arXiv Detail & Related papers (2024-11-09T11:15:17Z) - Disrupting Test Development with AI Assistants [1.024113475677323]
Generative AI-assisted coding tools like GitHub Copilot, ChatGPT, and Tabnine have significantly transformed software development.
This paper analyzes how these innovations impact productivity and software test development metrics.
arXiv Detail & Related papers (2024-11-04T17:52:40Z) - AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs.
Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z) - Hacking, The Lazy Way: LLM Augmented Pentesting [0.0]
We introduce a new concept called "LLM Augmented Pentesting" demonstrated with a tool named "Pentest Copilot"<n>Our approach focuses on overcoming the traditional resistance to automation in penetration testing by employing LLMs to automate specific sub-tasks.<n>Pentest Copilot showcases remarkable proficiency in tasks such as utilizing testing tools, interpreting outputs, and suggesting follow-up actions.
arXiv Detail & Related papers (2024-09-14T17:40:35Z) - AI-powered test automation tools: A systematic review and empirical evaluation [1.3490988186255937]
We investigate the features provided by existing AI-based test automation tools.
We empirically evaluate how the AI features can be helpful for effectiveness and efficiency of testing.
We also study the limitations of the AI features in AI-based test tools.
arXiv Detail & Related papers (2024-08-31T10:10:45Z) - CIPHER: Cybersecurity Intelligent Penetration-testing Helper for Ethical Researcher [1.6652242654250329]
We develop CIPHER (Cybersecurity Intelligent Penetration-testing Helper for Ethical Researchers), a large language model specifically trained to assist in penetration testing tasks.
We trained CIPHER using over 300 high-quality write-ups of vulnerable machines, hacking techniques, and documentation of open-source penetration testing tools.
We introduce the Findings, Action, Reasoning, and Results (FARR) Flow augmentation, a novel method to augment penetration testing write-ups to establish a fully automated pentesting simulation benchmark.
arXiv Detail & Related papers (2024-08-21T14:24:04Z) - PyTrial: Machine Learning Software and Benchmark for Clinical Trial
Applications [49.69824178329405]
PyTrial provides benchmarks and open-source implementations of a series of machine learning algorithms for clinical trial design and operations.
We thoroughly investigate 34 ML algorithms for clinical trials across 6 different tasks, including patient outcome prediction, trial site selection, trial outcome prediction, patient-trial matching, trial similarity search, and synthetic data generation.
PyTrial defines each task through a simple four-step process: data loading, model specification, model training, and model evaluation, all achievable with just a few lines of code.
arXiv Detail & Related papers (2023-06-06T21:19:03Z) - AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities
and Challenges [60.56413461109281]
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big data generated by IT Operations processes.
We discuss in depth the key types of data emitted by IT Operations activities, the scale and challenges in analyzing them, and where they can be helpful.
We categorize the key AIOps tasks as - incident detection, failure prediction, root cause analysis and automated actions.
arXiv Detail & Related papers (2023-04-10T15:38:12Z) - Realistic simulation of users for IT systems in cyber ranges [63.20765930558542]
We instrument each machine by means of an external agent to generate user activity.
This agent combines both deterministic and deep learning based methods to adapt to different environment.
We also propose conditional text generation models to facilitate the creation of conversations and documents.
arXiv Detail & Related papers (2021-11-23T10:53:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.