Fugu-MT 論文翻訳(概要): Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks

論文の概要: Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks

arxiv url: http://arxiv.org/abs/2605.23243v1
Date: Fri, 22 May 2026 05:24:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 17:29:20.209382
Title: Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks
Title（参考訳）: サイバーセキュリティのためのフロンティアLSMは準備が整っているか? デュアルモード脆弱性ベンチマークによる垂直ファンデーションモデルの証拠
Authors: Vivek Dahiya, Sunny Nehra, Vipul Dholariya, Bhavik Shangari, Chandra Khatri,
Abstract要約: デュアルモードベンチマークにより,フロンティアLSMがサイバーセキュリティの準備ができているかを評価する。我々は6つのフロンティアモデル(GPT-5.4、Codex5.3、Claude Opus4.6、Sonnet4.6、Gemini3.1Pro、Gemini3Flash)と4つのテストパラダイムにまたがる2つのドメイン特化モデルをテストする。
参考スコア（独自算出の注目度）: 0.3303672705634661
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and black-box web application security testing (five production-style applications with 118 ground-truth vulnerabilities across 20+ CWE families, which we will open-source). We test six frontier models (GPT-5.4, Codex~5.3, Claude Opus~4.6, Sonnet~4.6, Gemini~3.1~Pro and Gemini~3~Flash) and two domain-specialized models across four testing paradigms. Our findings are sobering: (1)~every frontier model produces 10-50% false positive rates in white-box detection, systematically over-predicting vulnerabilities; (2)~in black-box testing, frontier models achieve only 4-8% ground-truth coverage, improving to just 10-19% even with external security tools (Playwright MCP, Burp Suite MCP); (3)~structured penetration-testing methodology encoded in domain-specialized agents raises per-family detection above 50%, demonstrating that methodology, not scale, is the primary lever; and (4)~a domain-specialized defense model achieves the highest precision (0.904) and lowest false positive rate (9.7%) among all models, on a single GPU. We identify the absence of structured security testing traces end-to-end request/response sequences, failure-heavy data, and multi-step attack chains as the fundamental training data bottleneck, and propose self-play security testing as a data generation strategy. Our results make the case for vertical foundation models purpose-built for cybersecurity.
Abstract（参考訳）: ホワイトボックス関数レベル脆弱性検出(VulnLLM-R、C/Java/Pythonにまたがる)とブラックボックスWebアプリケーションセキュリティテスト(20以上のCWEファミリーに118のグランドトルース脆弱性を持つプロダクションスタイルの5つのアプリケーション)である。我々は6つのフロンティアモデル(GPT-5.4, Codex~5.3, Claude Opus~4.6, Sonnet~4.6, Gemini~3.1~Pro, Gemini~3~Flash)と4つのテストパラダイムにまたがる2つのドメイン特化モデルをテストする。その結果,(1)フロンティアモデルでは,外部セキュリティツール(Playwright MCP, Burp Suite MCP),(3)ドメイン固有化エージェントにコードされた構造的侵入試験手法では,家族ごとの検出が50%以上増加し,その方法論はスケールではなく,プライマリレバーである,(4)ドメイン固有化防衛モデルでは最大精度(0.904)と最低精度(97%)を1つのGPU上で達成する,という結果が得られた。構造化されたセキュリティテストの欠如、エンドツーエンドの要求/応答シーケンス、障害重大データ、マルチステップ攻撃チェーンが基本的なトレーニングデータボトルネックとして認識され、データ生成戦略としてセルフプレイセキュリティテストが提案される。この結果から,サイバーセキュリティを念頭に構築した垂直ファンデーションモデルについて考察した。

論文の概要: Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks

関連論文リスト