Fugu-MT 論文翻訳(概要): A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

論文の概要: A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

arxiv url: http://arxiv.org/abs/2508.18106v1
Date: Mon, 25 Aug 2025 15:11:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-26 18:43:45.833285
Title: A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code
Title（参考訳）: A.S.E:AI生成コードのセキュリティ評価のためのリポジトリレベルベンチマーク
Authors: Keke Lian, Bin Wang, Lei Zhang, Libo Chen, Junjie Wang, Ziming Zhao, Yujiu Yang, Haotong Duan, Haoran Zhao, Shuang Liao, Mingda Guo, Jiazheng Quan, Yilu Zhong, Chenhao He, Zichuan Chen, Jie Wu, Haoling Li, Zhaoxuan Li, Jiongchi Yu, Hui Li, Dong Zhang,
Abstract要約: A.S.E(AI Code Generation Security Evaluation)は、リポジトリレベルのセキュアコード生成のためのベンチマークである。 A.S.Eは、ドキュメント化されたCVEで実際のリポジトリからタスクを構築し、完全なリポジトリコンテキストを保存する。その再現性のあるコンテナ化評価フレームワークは、専門家定義のルールを使用して、セキュリティ、ビルド品質、生成安定性の安定的で監査可能な評価を提供する。
参考スコア（独自算出の注目度）: 48.10068691540979
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The increasing adoption of large language models (LLMs) in software engineering necessitates rigorous security evaluation of their generated code. However, existing benchmarks are inadequate, as they focus on isolated code snippets, employ unstable evaluation methods that lack reproducibility, and fail to connect the quality of input context with the security of the output. To address these gaps, we introduce A.S.E (AI Code Generation Security Evaluation), a benchmark for repository-level secure code generation. A.S.E constructs tasks from real-world repositories with documented CVEs, preserving full repository context like build systems and cross-file dependencies. Its reproducible, containerized evaluation framework uses expert-defined rules to provide stable, auditable assessments of security, build quality, and generation stability. Our evaluation of leading LLMs on A.S.E reveals three key findings: (1) Claude-3.7-Sonnet achieves the best overall performance. (2) The security gap between proprietary and open-source models is narrow; Qwen3-235B-A22B-Instruct attains the top security score. (3) Concise, ``fast-thinking'' decoding strategies consistently outperform complex, ``slow-thinking'' reasoning for security patching.
Abstract（参考訳）: ソフトウェア工学における大規模言語モデル(LLM)の採用の増加は、生成されたコードの厳格なセキュリティ評価を必要とする。しかし、既存のベンチマークは、独立したコードスニペットに焦点を当て、再現性に欠ける不安定な評価手法を採用し、入力コンテキストの品質と出力のセキュリティを結びつけることができないため、不十分である。これらのギャップに対処するために、リポジトリレベルのセキュアコード生成のベンチマークであるAIコード生成セキュリティ評価(AI Code Generation Security Evaluation)を紹介する。 A.S.Eはドキュメント化されたCVEで現実世界のリポジトリからタスクを構築し、ビルドシステムやファイル間の依存関係のような完全なリポジトリコンテキストを保存する。その再現性のあるコンテナ化評価フレームワークは、専門家定義のルールを使用して、セキュリティ、ビルド品質、生成安定性の安定的で監査可能な評価を提供する。 A.S.E 上での LLM の先行評価では,(1) Claude-3.7-Sonnet が最高性能を達成している。 2) プロプライエタリモデルとオープンソースモデルのセキュリティギャップは狭く,Qwen3-235B-A22B-Instruct がセキュリティスコアのトップに到達した。 (3) 簡潔, ` `fast-thinking' 復号化戦略は, セキュリティパッチに対する ` ``slow-thinking' の推論において, 常に複雑で, ` ``slow-thinking' の推論よりも優れていた。

論文の概要: A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

関連論文リスト