Fugu-MT 論文翻訳(概要): RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories

論文の概要: RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories

arxiv url: http://arxiv.org/abs/2601.22706v1
Date: Fri, 30 Jan 2026 08:29:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-02 18:28:15.322553
Title: RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories
Title（参考訳）: RealSec-bench: 現実世界のリポジトリでセキュアなコード生成を評価するベンチマーク
Authors: Yanlin Wang, Ziyao Zhang, Chong Wang, Xinyi Xu, Mingwei Liu, Yong Wang, Jiachi Chen, Zibin Zheng,
Abstract要約: LLM(Large Language Models)は、コード生成において顕著な能力を示しているが、セキュアなコードを生成する能力は依然として重要で、未調査の領域である。我々はRealSec-benchを紹介します。RealSec-benchは、現実世界の高リスクなJavaリポジトリから慎重に構築されたセキュアなコード生成のための新しいベンチマークです。
参考スコア（独自算出の注目度）: 58.32028251925354
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area. Existing benchmarks often fall short by relying on synthetic vulnerabilities or evaluating functional correctness in isolation, failing to capture the complex interplay between functionality and security found in real-world software. To address this gap, we introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories. Our methodology employs a multi-stage pipeline that combines systematic SAST scanning with CodeQL, LLM-based false positive elimination, and rigorous human expert validation. The resulting benchmark contains 105 instances grounded in real-word repository contexts, spanning 19 Common Weakness Enumeration (CWE) types and exhibiting a wide diversity of data flow complexities, including vulnerabilities with up to 34-hop inter-procedural dependencies. Using RealSec-bench, we conduct an extensive empirical study on 5 popular LLMs. We introduce a novel composite metric, SecurePass@K, to assess both functional correctness and security simultaneously. We find that while Retrieval-Augmented Generation (RAG) techniques can improve functional correctness, they provide negligible benefits to security. Furthermore, explicitly prompting models with general security guidelines often leads to compilation failures, harming functional correctness without reliably preventing vulnerabilities. Our work highlights the gap between functional and secure code generation in current LLMs.
Abstract（参考訳）: LLM(Large Language Models)は、コード生成において顕著な能力を示しているが、セキュアなコードを生成する能力は依然として重要で、未調査の領域である。既存のベンチマークは、合成脆弱性に頼るか、独立して機能的正当性を評価することで不足することが多く、現実世界のソフトウェアに見られる機能とセキュリティの間の複雑な相互作用を捉えていない。このギャップに対処するために、RealSec-benchを紹介します。RealSec-benchは、現実世界の高リスクなJavaリポジトリから慎重に構築されたセキュアなコード生成のための新しいベンチマークです。提案手法では,体系的なSASTスキャンとCodeQL,LLMに基づく偽陽性除去,厳密な人間専門家による検証を組み合わせた多段階パイプラインを用いる。結果として得られたベンチマークには、19のCommon Weakness Enumeration (CWE)タイプにまたがる105のインスタンスが含まれている。また,RealSec-benchを用いて,5つのLLMの広範な実験を行った。機能的正当性とセキュリティを同時に評価するために,新しい複合メトリックSecurePass@Kを導入する。 Retrieval-Augmented Generation (RAG)技術は機能的正当性を向上するが、セキュリティには無視できる利点がある。さらに、一般的なセキュリティガイドラインでモデルを明示的にプロンプトすることは、しばしばコンパイルの失敗を招き、脆弱性を確実に防止することなく機能的正しさを損なう。我々の研究は、現在のLLMにおける機能的コード生成とセキュアコード生成のギャップを強調します。

論文の概要: RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories

関連論文リスト