PentestGPT: An LLM-empowered Automatic Penetration Testing Tool
- URL: http://arxiv.org/abs/2308.06782v2
- Date: Sun, 2 Jun 2024 16:16:49 GMT
- Title: PentestGPT: An LLM-empowered Automatic Penetration Testing Tool
- Authors: Gelei Deng, Yi Liu, VĂctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, Stefan Rass,
- Abstract summary: Large Language Models (LLMs) have shown significant advancements in various domains.
We evaluate the performance of LLMs on real-world penetration testing tasks using a robust benchmark created from test machines with platforms.
We introduce PentestGPT, an LLM-empowered automatic penetration testing tool.
- Score: 20.449761406790415
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Penetration testing, a crucial industrial practice for ensuring system security, has traditionally resisted automation due to the extensive expertise required by human professionals. Large Language Models (LLMs) have shown significant advancements in various domains, and their emergent abilities suggest their potential to revolutionize industries. In this research, we evaluate the performance of LLMs on real-world penetration testing tasks using a robust benchmark created from test machines with platforms. Our findings reveal that while LLMs demonstrate proficiency in specific sub-tasks within the penetration testing process, such as using testing tools, interpreting outputs, and proposing subsequent actions, they also encounter difficulties maintaining an integrated understanding of the overall testing scenario. In response to these insights, we introduce PentestGPT, an LLM-empowered automatic penetration testing tool that leverages the abundant domain knowledge inherent in LLMs. PentestGPT is meticulously designed with three self-interacting modules, each addressing individual sub-tasks of penetration testing, to mitigate the challenges related to context loss. Our evaluation shows that PentestGPT not only outperforms LLMs with a task-completion increase of 228.6\% compared to the \gptthree model among the benchmark targets but also proves effective in tackling real-world penetration testing challenges. Having been open-sourced on GitHub, PentestGPT has garnered over 4,700 stars and fostered active community engagement, attesting to its value and impact in both the academic and industrial spheres.
Related papers
- Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation [11.056044348209483]
Unit testing, crucial for identifying bugs in code modules like classes and methods, is often neglected by developers due to time constraints.
Large Language Models (LLMs), like GPT and Mistral, show promise in software engineering, including in test generation.
arXiv Detail & Related papers (2024-06-28T20:38:41Z) - Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [85.51252685938564]
Uncertainty quantification (UQ) is becoming increasingly recognized as a critical component of applications that rely on machine learning (ML)
As with other ML models, large language models (LLMs) are prone to make incorrect predictions, hallucinate'' by fabricating claims, or simply generate low-quality output for a given input.
We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines, and provides an environment for controllable and consistent evaluation of novel techniques.
arXiv Detail & Related papers (2024-06-21T20:06:31Z) - Test Oracle Automation in the era of LLMs [52.69509240442899]
Large Language Models (LLMs) have demonstrated remarkable proficiency in tackling diverse software testing tasks.
This paper aims to enable discussions on the potential of using LLMs for test oracle automation, along with the challenges that may emerge during the generation of various types of oracles.
arXiv Detail & Related papers (2024-05-21T13:19:10Z) - RepEval: Effective Text Evaluation with LLM Representation [54.07909112633993]
We introduce RepEval, the first metric leveraging the projection of LLM representations for evaluation.
RepEval requires minimal sample pairs for training, and through simple prompt modifications, it can easily transition to various tasks.
Results on ten datasets from three tasks demonstrate the high effectiveness of our method.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - Enhancing LLM-based Test Generation for Hard-to-Cover Branches via Program Analysis [8.31978033489419]
We propose TELPA, a novel technique to generate tests that can reach hard-to-cover branches.
Our experimental results on 27 open-source Python projects demonstrate that TELPA significantly outperforms the state-of-the-art SBST and LLM-based techniques.
arXiv Detail & Related papers (2024-04-07T14:08:28Z) - LLbezpeky: Leveraging Large Language Models for Vulnerability Detection [10.330063887545398]
Large Language Models (LLMs) have shown tremendous potential in understanding semnatics in human as well as programming languages.
We focus on building an AI-driven workflow to assist developers in identifying and rectifying vulnerabilities.
arXiv Detail & Related papers (2024-01-02T16:14:30Z) - LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks [2.7719338074999538]
Penetration testing allows organizations to proactively identify and remediate vulnerabilities in their systems.
One recent advancement in the realm of penetration testing is the utilization of Language Models (LLMs)
We explore the intersection of LLMs and penetration testing to gain insight into their capabilities and challenges in the context of privilege escalation.
arXiv Detail & Related papers (2023-10-17T17:15:41Z) - Identifying the Risks of LM Agents with an LM-Emulated Sandbox [68.26587052548287]
Language Model (LM) agents and tools enable a rich set of capabilities but also amplify potential risks.
High cost of testing these agents will make it increasingly difficult to find high-stakes, long-tailed risks.
We introduce ToolEmu: a framework that uses an LM to emulate tool execution and enables the testing of LM agents against a diverse range of tools and scenarios.
arXiv Detail & Related papers (2023-09-25T17:08:02Z) - Getting pwn'd by AI: Penetration Testing with Large Language Models [0.0]
This paper explores the potential usage of large-language models, such as GPT3.5, to augment penetration testers with AI sparring partners.
We explore the feasibility of supplementing penetration testers with AI models for two distinct use cases: high-level task planning for security testing assignments and low-level vulnerability hunting within a vulnerable virtual machine.
arXiv Detail & Related papers (2023-07-24T19:59:22Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - Scalable Evaluation of Multi-Agent Reinforcement Learning with Melting
Pot [71.28884625011987]
Melting Pot is a MARL evaluation suite that uses reinforcement learning to reduce the human labor required to create novel test scenarios.
We have created over 80 unique test scenarios covering a broad range of research topics.
We apply these test scenarios to standard MARL training algorithms, and demonstrate how Melting Pot reveals weaknesses not apparent from training performance alone.
arXiv Detail & Related papers (2021-07-14T17:22:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.