Fugu-MT 論文翻訳(概要): Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks

論文の概要: Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks

arxiv url: http://arxiv.org/abs/2604.17159v1
Date: Sat, 18 Apr 2026 22:13:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.365132
Title: Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks
Title（参考訳）: 攻撃型サイバータスクのためのフロンティア大言語モデルの体系的能力ベンチマーク
Authors: Tyler H. Merves, Michael H. Conaway, Joseph M. Escobar, Hakan T. Otal, Unal Tatar,
Abstract要約: 我々は、NYU CTF Benchの200の課題すべてについて、7つのプロバイダから10のフロンティアモデルを評価する。制御された因子分析により、Kali Linux環境はUbuntuよりも9.5パーセント向上していることがわかった。モデルの中では、Claude 4.5 Opusが最も高い解決率(59%)を達成し、続いてGemini 3 Pro(52%)、そしてGemini 3 Flashは1ソルバあたり0.05ドルで最高のコスト効率を提供する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present, to our knowledge, the most comprehensive cross-model evaluation of LLM agents on offensive cybersecurity tasks, benchmarking 10 frontier models from 7 providers on all 200 challenges of the NYU CTF Bench. Building on the D-CIPHER multi-agent framework, we extend it with multi-provider backend support, a custom Kali Linux environment with over 100 pre-installed penetration testing tools, and runtime tool-discovery agents. Through a controlled factorial study, we find that the Kali Linux environment yields a +9.5 percentage-point improvement over Ubuntu, while auto-prompting and category-specific tips often degrade performance in well-equipped environments. Among models, Claude 4.5 Opus achieves the highest solve rate (59%), followed by Gemini 3 Pro (52%), with Gemini 3 Flash offering the best cost-efficiency at $0.05 per solve. Asymmetric planner/executor model assignments provide no meaningful benefit while coherent same-model configurations consistently outperform mixed-tier pairings. Our results indicate that environment tooling and model selection emerge as the strongest drivers of performance, whereas prompt engineering interventions show diminishing or negative returns in well-equipped environments. Reported performance reflects both model reasoning ability and compatibility with agent tooling and API integration.
Abstract（参考訳）: 我々は、我々の知る限り、LLMエージェントの攻撃的なサイバーセキュリティタスクに関する最も包括的なクロスモデル評価を行い、NYU CTF Benchの200の課題すべてについて、7つのプロバイダから10のフロンティアモデルをベンチマークします。 D-CIPHERのマルチエージェントフレームワーク上に構築されており、マルチプロデューサのバックエンドサポート、100以上のインストール済み浸透テストツールを備えたカスタムのKali Linux環境、ランタイムツール発見エージェントで拡張しています。制御された要因分析により、Kali Linux環境はUbuntuよりも+9.5パーセント向上し、オートプロンプティングやカテゴリ固有のチップは、よく装備された環境でパフォーマンスが劣化することがわかった。モデルの中では、Claude 4.5 Opusが最も高い解決率(59%)を達成し、続いてGemini 3 Pro(52%)、そしてGemini 3 Flashは1ソルバあたり0.05ドルで最高のコスト効率を提供する。非対称プランナー/実行モデル割り当ては有意義な利益をもたらすが、コヒーレントな同モデル構成は混合層ペアリングを一貫して上回る。以上の結果から,環境ツールとモデル選択が性能最強の要因として出現するのに対し,迅速な工学的介入は,十分に装備された環境において低下または負のリターンを示すことが示唆された。報告されたパフォーマンスは、モデル推論能力とエージェントツールとの互換性とAPI統合の両方を反映している。

論文の概要: Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks

関連論文リスト