Fugu-MT 論文翻訳(概要): VulnRepairEval: An Exploit-Based Evaluation Framework for Assessing Large Language Model Vulnerability Repair Capabilities

論文の概要: VulnRepairEval: An Exploit-Based Evaluation Framework for Assessing Large Language Model Vulnerability Repair Capabilities

arxiv url: http://arxiv.org/abs/2509.03331v1
Date: Wed, 03 Sep 2025 14:06:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 21:40:46.541762
Title: VulnRepairEval: An Exploit-Based Evaluation Framework for Assessing Large Language Model Vulnerability Repair Capabilities
Title（参考訳）: VulnRepairEval: 大規模言語モデルの脆弱性修復能力を評価するためのエクスプロイトベースの評価フレームワーク
Authors: Weizhe Wang, Wei Ma, Qiang Hu, Yao Zhang, Jianfei Sun, Bin Wu, Yang Liu, Guangquan Xu, Lingxiao Jiang,
Abstract要約: VulnRepairEvalは、関数型Proof-of-Conceptエクスプロイトに固定された評価フレームワークである。我々のフレームワークは、再現可能な微分評価を可能にする包括的でコンテナ化された評価パイプラインを提供する。
参考スコア（独自算出の注目度）: 41.85494398578654
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The adoption of Large Language Models (LLMs) for automated software vulnerability patching has shown promising outcomes on carefully curated evaluation sets. Nevertheless, existing datasets predominantly rely on superficial validation methods rather than exploit-based verification, leading to overestimated performance in security-sensitive applications. This paper introduces VulnRepairEval, an evaluation framework anchored in functional Proof-of-Concept (PoC) exploits. Our framework delivers a comprehensive, containerized evaluation pipeline that enables reproducible differential assessment, where repair success requires the original exploit to fail execution against the modified code. The benchmark construction involved extensive data curation: we processed over 400 CVEs and approximately 2,500 potential sources to extract a collection of authentic vulnerability instances (23 Python CVEs) amenable to automated testing with working PoCs. Through VulnRepairEval, we conduct a comprehensive evaluation of 12 popular LLMs and observe a significant performance deficit: even the top-performing model successfully addresses merely 5/23 instances (about 21.7%), exposing critical weaknesses in security-focused applications. Our failure analysis reveals that most unsuccessful attempts stem from imprecise vulnerability identification and patches containing syntactic or semantic errors. Enhanced prompting strategies and multi-agent approaches yield minimal improvements, with overall effectiveness remaining largely unaffected. This work contributes a stringent, practical evaluation framework for LLM-driven vulnerability remediation and underscores the necessity for assessment protocols that authentically reflect real-world exploitation scenarios.
Abstract（参考訳）: ソフトウェア脆弱性パッチの自動パッチに対するLarge Language Models (LLMs)の採用は、慎重にキュレートされた評価セットに対して有望な結果を示している。それでも既存のデータセットは、エクスプロイトベースの検証ではなく、表面的な検証方法に大きく依存しているため、セキュリティに敏感なアプリケーションではパフォーマンスが過大評価されている。本稿では,関数型Proof-of-Concept(PoC)エクスプロイトに固定された評価フレームワークであるVulnRepairEvalを紹介する。私たちのフレームワークは、再現可能な差分評価を可能にする包括的なコンテナ化評価パイプラインを提供しています。私たちは400以上のCVEと約2,500の潜在的なソースを処理し、実際の脆弱性インスタンス(23のPython CVE)のコレクションを抽出し、動作中のPoCによる自動テストを可能にしました。 VulnRepairEvalを通じて、12の人気のあるLCMを総合的に評価し、大幅な性能低下を観察する。トップパフォーマンスモデルでさえ、単に5/23インスタンス(約21.7%)に対処することができ、セキュリティに重点を置くアプリケーションにおいて重大な弱点が明らかになる。我々の失敗分析によると、最も失敗した試みは、不正確な脆弱性の識別と、構文的または意味的誤りを含むパッチに起因する。プロンプト戦略の強化とマルチエージェントアプローチは最小限の改善をもたらすが、全体的な効果はほとんど影響を受けない。この研究は、LLM駆動の脆弱性修復のための厳密で実践的な評価フレームワークを提供し、実世界の悪用シナリオを忠実に反映する評価プロトコルの必要性を浮き彫りにしている。

論文の概要: VulnRepairEval: An Exploit-Based Evaluation Framework for Assessing Large Language Model Vulnerability Repair Capabilities

関連論文リスト