Related papers: HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing

HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing

URL: http://arxiv.org/abs/2412.01778v1
Date: Mon, 02 Dec 2024 18:28:18 GMT
Title: HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing
Authors: Lajos Muzsai, David Imolai, András Lukács,
Abstract summary: We introduce Hack Synth, a novel Large Language Model (LLM)-based agent capable of autonomous penetration testing.<n>To benchmark Hack Synth, we propose two new Capture The Flag (CTF)-based benchmark sets utilizing the popular platforms PicoCTF and OverTheWire.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We introduce HackSynth, a novel Large Language Model (LLM)-based agent capable of autonomous penetration testing. HackSynth's dual-module architecture includes a Planner and a Summarizer, which enable it to generate commands and process feedback iteratively. To benchmark HackSynth, we propose two new Capture The Flag (CTF)-based benchmark sets utilizing the popular platforms PicoCTF and OverTheWire. These benchmarks include two hundred challenges across diverse domains and difficulties, providing a standardized framework for evaluating LLM-based penetration testing agents. Based on these benchmarks, extensive experiments are presented, analyzing the core parameters of HackSynth, including creativity (temperature and top-p) and token utilization. Multiple open source and proprietary LLMs were used to measure the agent's capabilities. The experiments show that the agent performed best with the GPT-4o model, better than what the GPT-4o's system card suggests. We also discuss the safety and predictability of HackSynth's actions. Our findings indicate the potential of LLM-based agents in advancing autonomous penetration testing and the importance of robust safeguards. HackSynth and the benchmarks are publicly available to foster research on autonomous cybersecurity solutions.

Related papers

AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage. We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility. Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z)
How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities [62.474732677086855]
Large language model (LLM) routing has emerged as a crucial strategy for balancing computational costs with performance. We propose the DSC benchmark: Diverse, Simple, and Categorized, an evaluation framework that categorizes router performance across a broad spectrum of query types.
arXiv Detail & Related papers (2025-03-20T19:52:30Z)
Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning [59.25951947621526]
We propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers. We release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs. Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.
arXiv Detail & Related papers (2025-02-19T15:32:11Z)
The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility. This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance. We analyze experimental results obtained from testing DeepSeek-R1 on our benchmark and reveal the critical ethical concerns raised by this highly acclaimed model.
arXiv Detail & Related papers (2025-01-20T06:35:01Z)
The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents. We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature. We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z)
AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs. Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z)
SimBench: A Rule-Based Multi-Turn Interaction Benchmark for Evaluating an LLM's Ability to Generate Digital Twins [8.244444633880603]
We introduce SimBench, a benchmark designed to evaluate the proficiency of student large language models (S-LLMs) in generating digital twins (DTs) Given a collection of S-LLMs, this benchmark enables the ranking of the S-LLMs based on their ability to produce high-quality DTs.
arXiv Detail & Related papers (2024-08-21T20:52:32Z)
NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security [28.125179435861316]
Large Language Models (LLMs) are being deployed across various domains, but their capacity to solve Capture the Flag (CTF) challenges has not been thoroughly evaluated. We develop a novel method to assess LLMs in solving CTF challenges by creating a scalable, open-source benchmark database. This database includes metadata for LLM testing and adaptive learning, compiling a diverse range of CTF challenges from popular competitions.
arXiv Detail & Related papers (2024-06-08T22:21:42Z)
PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels. We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings. We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z)
Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs) We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence. Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z)
A Preliminary Study on Using Large Language Models in Software Pentesting [2.0551676463612636]
Large language models (LLM) are perceived to offer promising potentials for automating security tasks. We investigate the use of LLMs in software pentesting, where the main task is to automatically identify software security vulnerabilities in source code.
arXiv Detail & Related papers (2024-01-30T21:42:59Z)
Selene: Pioneering Automated Proof in Software Verification [62.09555413263788]
We introduce Selene, which is the first project-level automated proof benchmark constructed based on the real-world industrial-level operating system microkernel, seL4. Our experimental results with advanced large language models (LLMs), such as GPT-3.5-turbo and GPT-4, highlight the capabilities of LLMs in the domain of automated proof generation.
arXiv Detail & Related papers (2024-01-15T13:08:38Z)
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [98.18244218156492]
Large Language Models (LLMs) have significantly advanced natural language processing.<n>As their applications expand into multi-agent environments, there arises a need for a comprehensive evaluation framework.<n>This work introduces a novel competition-based benchmark framework to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z)
LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks [0.0]
We explore the intersection of Language Models (LLMs) and penetration testing. We introduce a fully automated privilege-escalation tool for evaluating the efficacy of LLMs for (ethical) hacking. We analyze the impact of different context sizes, in-context learning, optional high-level mechanisms, and memory management techniques.
arXiv Detail & Related papers (2023-10-17T17:15:41Z)
Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks [2.0315147707806283]
Mystique is an accurate and scalable framework for production AI benchmark generation. Mystique is scalable, due to its lightweight data collection, in terms of overhead runtime and instrumentation effort. We evaluate our methodology on several production AI models, and show that benchmarks generated with Mystique closely resemble original AI models.
arXiv Detail & Related papers (2022-12-16T18:46:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.