Fugu-MT 論文翻訳(概要): Shell or Nothing: Real-World Benchmarks and Memory-Activated Agents for Automated Penetration Testing

論文の概要: Shell or Nothing: Real-World Benchmarks and Memory-Activated Agents for Automated Penetration Testing

arxiv url: http://arxiv.org/abs/2509.09207v1
Date: Thu, 11 Sep 2025 07:30:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-12 16:52:24.268642
Title: Shell or Nothing: Real-World Benchmarks and Memory-Activated Agents for Automated Penetration Testing
Title（参考訳）: Shell or Nothing: 自動浸透テストのための実世界のベンチマークとメモリアクティベートエージェント
Authors: Wuyuao Mai, Geng Hong, Qi Liu, Jinsong Chen, Jiarun Dai, Xudong Pan, Yuan Zhang, Min Yang,
Abstract要約: 本稿では,世界初の実世界のエージェント指向ペンテストベンチマークTermiBenchを紹介する。本稿では,多エージェント浸透試験フレームワークTermiAgentを提案する。評価において,本研究は最先端のエージェントより優れ,より強力な浸透試験能力を示す。
参考スコア（独自算出の注目度）: 23.554239007767276
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Penetration testing is critical for identifying and mitigating security vulnerabilities, yet traditional approaches remain expensive, time-consuming, and dependent on expert human labor. Recent work has explored AI-driven pentesting agents, but their evaluation relies on oversimplified capture-the-flag (CTF) settings that embed prior knowledge and reduce complexity, leading to performance estimates far from real-world practice. We close this gap by introducing the first real-world, agent-oriented pentesting benchmark, TermiBench, which shifts the goal from 'flag finding' to achieving full system control. The benchmark spans 510 hosts across 25 services and 30 CVEs, with realistic environments that require autonomous reconnaissance, discrimination between benign and exploitable services, and robust exploit execution. Using this benchmark, we find that existing systems can hardly obtain system shells under realistic conditions. To address these challenges, we propose TermiAgent, a multi-agent penetration testing framework. TermiAgent mitigates long-context forgetting with a Located Memory Activation mechanism and builds a reliable exploit arsenal via structured code understanding rather than naive retrieval. In evaluations, our work outperforms state-of-the-art agents, exhibiting stronger penetration testing capability, reducing execution time and financial cost, and demonstrating practicality even on laptop-scale deployments. Our work delivers both the first open-source benchmark for real-world autonomous pentesting and a novel agent framework that establishes a milestone for AI-driven penetration testing.
Abstract（参考訳）: 侵入テストはセキュリティ脆弱性の特定と緩和に不可欠だが、従来のアプローチは高価であり、時間がかかり、専門家の労働力に依存している。最近の研究は、AI駆動のペンテスティングエージェントを探索しているが、その評価は、事前知識を組み込んだ過剰に単純化されたキャプチャー・ザ・フラッグ(CTF)設定に依存し、複雑さを減らし、実際の実践からは程遠いパフォーマンス推定につながっている。このギャップを埋めるために、最初の実世界のエージェント指向のペンテストベンチマークであるTermiBenchを導入しました。ベンチマークは25のサービスと30のCVEにわたる510のホストにまたがっており、自律的な偵察、良質なサービスと悪用可能なサービスの識別、堅牢なエクスプロイト実行を必要とする現実的な環境を備えている。このベンチマークにより,既存のシステムでは現実的な条件下でシステムシェルを取得できないことがわかった。これらの課題に対処するため,多エージェント浸透試験フレームワークであるTermiAgentを提案する。 TermiAgentは、Located Memory Activationメカニズムで長いコンテキストの忘れを軽減し、単純な検索ではなく構造化コード理解を通じて信頼できるエクスプロイト兵器を構築する。評価において,本研究は最先端エージェントより優れ,より強力な浸透試験能力を示し,実行時間と金銭的コストを低減し,ラップトップ規模の展開においても実用性を示す。私たちの研究は、現実の自律的なペンテストのための最初のオープンソースベンチマークと、AI駆動の浸透テストのマイルストーンを確立する新しいエージェントフレームワークの両方を提供しています。

論文の概要: Shell or Nothing: Real-World Benchmarks and Memory-Activated Agents for Automated Penetration Testing

関連論文リスト