Fugu-MT 論文翻訳(概要): SecCodeBench-V2 Technical Report

論文の概要: SecCodeBench-V2 Technical Report

arxiv url: http://arxiv.org/abs/2602.15485v1
Date: Tue, 17 Feb 2026 10:47:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-18 16:03:18.036119
Title: SecCodeBench-V2 Technical Report
Title（参考訳）: SecCodeBench-V2テクニカルレポート
Authors: Longfei Chen, Ji Zhao, Lanxiao Cui, Tong Su, Xingbo Pan, Ziyang Li, Yongxing Wu, Qijiang Cao, Qiyao Cai, Jing Zhang, Yuandong Ni, Junyao He, Zeyu Zhang, Chao Ge, Xuhuai Lu, Zeyu Gao, Yuxin Cui, Weisen Chen, Yuxuan Peng, Shengping Wang, Qi Li, Yukai Huang, Yukun Liu, Tuo Zhou, Terry Yue Zhuo, Junyang Lin, Chao Zhang,
Abstract要約: SecCodeBench-V2は,セキュアなコードを生成するLarge Language Model (LLM)コピロの能力を評価するためのベンチマークである。 SecCodeBench-V2は、アリババグループの工業生産から派生した98世代および修正シナリオを含む。各シナリオに対して、SecCodeBench-V2は、機能検証とセキュリティ検証の両方に対して実行可能な概念実証(PoC)テストケースを提供する。
参考スコア（独自算出の注目度）: 43.10947096543533
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce SecCodeBench-V2, a publicly released benchmark for evaluating Large Language Model (LLM) copilots' capabilities of generating secure code. SecCodeBench-V2 comprises 98 generation and fix scenarios derived from Alibaba Group's industrial productions, where the underlying security issues span 22 common CWE (Common Weakness Enumeration) categories across five programming languages: Java, C, Python, Go, and Node.js. SecCodeBench-V2 adopts a function-level task formulation: each scenario provides a complete project scaffold and requires the model to implement or patch a designated target function under fixed interfaces and dependencies. For each scenario, SecCodeBench-V2 provides executable proof-of-concept (PoC) test cases for both functional validation and security verification. All test cases are authored and double-reviewed by security experts, ensuring high fidelity, broad coverage, and reliable ground truth. Beyond the benchmark itself, we build a unified evaluation pipeline that assesses models primarily via dynamic execution. For most scenarios, we compile and run model-generated artifacts in isolated environments and execute PoC test cases to validate both functional correctness and security properties. For scenarios where security issues cannot be adjudicated with deterministic test cases, we additionally employ an LLM-as-a-judge oracle. To summarize performance across heterogeneous scenarios and difficulty levels, we design a Pass@K-based scoring protocol with principled aggregation over scenarios and severity, enabling holistic and comparable evaluation across models. Overall, SecCodeBench-V2 provides a rigorous and reproducible foundation for assessing the security posture of AI coding assistants, with results and artifacts released at https://alibaba.github.io/sec-code-bench. The benchmark is publicly available at https://github.com/alibaba/sec-code-bench.
Abstract（参考訳）: 我々はSecCodeBench-V2を紹介した。SecCodeBench-V2は、セキュアなコードを生成するLarge Language Model(LLM)のコピロの能力を評価するための公開ベンチマークである。 SecCodeBench-V2は、Alibaba Groupのインダストリアルプロダクションから派生した98の生成と修正シナリオで構成されており、基礎となるセキュリティ問題は、Java、C、Python、Go、Node.jsの5つのプログラミング言語にまたがる22の共通CWE(Common Weakness Enumeration)カテゴリにまたがっている。 SecCodeBench-V2では、関数レベルのタスクの定式化が採用されている。各シナリオは、完全なプロジェクトスキャフォールドを提供し、固定されたインターフェースと依存関係の下で指定されたターゲット関数を実装したり、パッチしたりする必要がある。各シナリオに対して、SecCodeBench-V2は、機能検証とセキュリティ検証の両方に対して実行可能な概念実証(PoC)テストケースを提供する。すべてのテストケースは、セキュリティ専門家によって作成され、二重レビューされ、高い忠実さ、広範なカバレッジ、信頼性の高い土台真実が保証されます。ベンチマーク自体以外にも、モデルを主に動的実行を通じて評価する統合評価パイプラインを構築しています。ほとんどのシナリオでは、分離された環境でモデル生成アーティファクトをコンパイルして実行し、PoCテストケースを実行して、機能的正当性とセキュリティ特性の両方を検証する。セキュリティ問題を決定論的テストケースで判断できないシナリオでは、LLM-as-a-judgeのオラクルも使用します。ヘテロジニアスなシナリオと難易度での性能を要約するために、シナリオと重大性に対する原則的なアグリゲーションを備えたPass@Kベースのスコアリングプロトコルを設計し、モデル全体の全体的および同等の評価を可能にする。 SecCodeBench-V2は、AIコーディングアシスタントのセキュリティ姿勢を評価するための厳密で再現可能な基盤を提供する。ベンチマークはhttps://github.com/alibaba/sec-code-bench.comで公開されている。

論文の概要: SecCodeBench-V2 Technical Report

関連論文リスト