Fugu-MT 論文翻訳(概要): AgentCyberRange: Benchmarking Frontier AI Systems in Realistic Cyber Ranges

論文の概要: AgentCyberRange: Benchmarking Frontier AI Systems in Realistic Cyber Ranges

arxiv url: http://arxiv.org/abs/2606.14295v2
Date: Tue, 16 Jun 2026 04:20:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 15:01:46.637094
Title: AgentCyberRange: Benchmarking Frontier AI Systems in Realistic Cyber Ranges
Title（参考訳）: AgentCyberRange: 現実的なサイバー範囲におけるフロンティアAIシステムのベンチマーク
Authors: Fengyu Liu, Jiarun Dai, Yihe Fan, Wuyuao Mai, Ziao Li, Bofei Chen, Jie Zhang, Zheng Lou, Bocheng Xiang, Qiyi Zhang, Xudong Pan, Geng Hong, Yuan Zhang, Min Yang,
Abstract要約: 我々はAgentCyberRangeを紹介した。AgentCyberRangeは、現実的なサイバー範囲で自律的なサイバー攻撃能力を測定するための、最初のオープンでマルチレンジのインフラである。 15の実際のWebアプリケーションと8つのエンタープライズライクなサイバーレンジに156の内部ホストを組み合わせた110の脆弱性に加えて、実行、オーケストレーション、結果収集、検証用のツールチェーンであるCageも備えている。我々は、一致したプロンプトと予算の下で、6つのフロンティアAIシステムを評価します。GPT-5.5 with Codexは、Webエクスプロイトタスクの16.1%、探索後のタスクの31.7%を解決します。
参考スコア（独自算出の注目度）: 20.164879773235594
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Frontier AI systems are increasingly capable of cybersecurity tasks, including codebase inspection, vulnerability detection, and exploitation. However, evaluating their offensive capabilities remains constrained by limited access to open, reproducible, multi-host cyber ranges. Existing public benchmarks capture isolated skills such as CTF solving, vulnerability reproduction, and exploit generation, but often abstract away realistic intrusion workflows: discovering exposed services, gaining a foothold, collecting internal information, and expanding compromise across hosts. This gap makes it difficult to observe emerging risks early, because frontier AI systems are rarely evaluated under realistic attack conditions. We introduce AgentCyberRange, the first open, multi-range infrastructure for measuring autonomous cyber attack capability in realistic cyber ranges. It combines 110 vulnerabilities across 15 real web applications and 8 enterprise-like cyber ranges with 156 internal hosts, plus Cage, a toolchain for execution, orchestration, result collection, and verification. The benchmark covers two core stages: web exploitation, where agents explore exposed applications and validate vulnerabilities, and post exploitation, where agents turn an initial foothold into broader internal compromise. We evaluate six frontier AI systems under matched prompts and budgets. GPT-5.5 with Codex performs best, solving 16.1% of web exploitation tasks and 31.7% of post-exploitation tasks; with more concrete hints, these rates increase to 33.0% and 46.3%. We also observe out-of-benchmark findings, including unknown vulnerabilities in popular projects, and payload mutation that bypasses host defenses. These results show that open cyber-range evaluation is necessary for observing emerging offensive capabilities under realistic and reproducible conditions.
Abstract（参考訳）: 最前線のAIシステムは、コードベースの検査、脆弱性検出、エクスプロイトなど、サイバーセキュリティタスクの能力がますます高まっている。しかし、攻撃能力の評価は、オープンで再現可能な、マルチホストのサイバーレンジへのアクセス制限により、依然として制限されている。既存の公開ベンチマークは、CTF解決、脆弱性の再現、エクスプロイト生成などの独立したスキルをキャプチャするが、露呈したサービスの検出、足場獲得、内部情報収集、ホスト間の妥協の拡大など、現実的な侵入ワークフローを抽象化することが多い。このギャップは、フロンティアAIシステムが現実的な攻撃条件下で評価されることが滅多にないため、出現するリスクを早期に観察することが困難になる。我々はAgentCyberRangeを紹介した。AgentCyberRangeは、現実的なサイバー範囲で自律的なサイバー攻撃能力を測定するための、最初のオープンでマルチレンジのインフラである。 15の実際のWebアプリケーションと8つのエンタープライズライクなサイバーレンジに156の内部ホストを組み合わせた110の脆弱性に加えて、実行、オーケストレーション、結果収集、検証用のツールチェーンであるCageも備えている。ベンチマークは2つの中核的なステージをカバーしている。Webエクスプロイトでは、エージェントが露出したアプリケーションを探索し、脆弱性を検証する。我々は、一致したプロンプトと予算の下で、6つのフロンティアAIシステムを評価する。 GPT-5.5 with Codexは16.1%のWebエクスプロイトタスクと31.7%のポストエクスプロイテーションタスクを解決し、より具体的なヒントでこれらのレートは33.0%と46.3%に上昇する。また、一般的なプロジェクトにおける未知の脆弱性や、ホストの防御をバイパスするペイロード変異など、ベンチマーク外の発見も観察する。これらの結果から,現実的かつ再現可能な条件下で出現する攻撃能力を観察するには,オープンなサイバーレンジ評価が必要であることが示唆された。

論文の概要: AgentCyberRange: Benchmarking Frontier AI Systems in Realistic Cyber Ranges

関連論文リスト