Fugu-MT 論文翻訳(概要): From Lab to Reality: A Practical Evaluation of Deep Learning Models and LLMs for Vulnerability Detection

論文の概要: From Lab to Reality: A Practical Evaluation of Deep Learning Models and LLMs for Vulnerability Detection

arxiv url: http://arxiv.org/abs/2512.10485v1
Date: Thu, 11 Dec 2025 10:04:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-12 16:15:42.312199
Title: From Lab to Reality: A Practical Evaluation of Deep Learning Models and LLMs for Vulnerability Detection
Title（参考訳）: ラボから現実へ: 脆弱性検出のためのディープラーニングモデルとLLMの実践的評価
Authors: Chaomeng Lu, Bert Lagaisse,
Abstract要約: ディープ・ラーニング(DL)に基づく脆弱性検出手法は,ベンチマーク・データセットにおいて高い性能を示したが,実際の有効性は未解明のままである。最近の研究は、グラフニューラルネットワーク(GNN)ベースのモデルと、大言語モデル(LLM)を含むトランスフォーマーベースのモデルの両方が、キュレートされたベンチマークデータセットで評価すると有望な結果が得られることを示唆している。本研究では,2つの代表的なDLモデルであるReVealとLineVulの4つの代表的なデータセットを体系的に評価する。
参考スコア（独自算出の注目度）: 2.8647133890967
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vulnerability detection methods based on deep learning (DL) have shown strong performance on benchmark datasets, yet their real-world effectiveness remains underexplored. Recent work suggests that both graph neural network (GNN)-based and transformer-based models, including large language models (LLMs), yield promising results when evaluated on curated benchmark datasets. These datasets are typically characterized by consistent data distributions and heuristic or partially noisy labels. In this study, we systematically evaluate two representative DL models-ReVeal and LineVul-across four representative datasets: Juliet, Devign, BigVul, and ICVul. Each model is trained independently on each respective dataset, and their code representations are analyzed using t-SNE to uncover vulnerability related patterns. To assess realistic applicability, we deploy these models along with four pretrained LLMs, Claude 3.5 Sonnet, GPT-o3-mini, GPT-4o, and GPT-5 on a curated dataset, VentiVul, comprising 20 recently (May 2025) fixed vulnerabilities from the Linux kernel. Our experiments reveal that current models struggle to distinguish vulnerable from non-vulnerable code in representation space and generalize poorly across datasets with differing distributions. When evaluated on VentiVul, our newly constructed time-wise out-of-distribution dataset, performance drops sharply, with most models failing to detect vulnerabilities reliably. These results expose a persistent gap between academic benchmarks and real-world deployment, emphasizing the value of our deployment-oriented evaluation framework and the need for more robust code representations and higher-quality datasets.
Abstract（参考訳）: ディープ・ラーニング(DL)に基づく脆弱性検出手法は,ベンチマーク・データセットにおいて高い性能を示したが,実際の有効性は未解明のままである。最近の研究は、グラフニューラルネットワーク(GNN)ベースのモデルと、大きな言語モデル(LLM)を含むトランスフォーマーベースのモデルの両方が、キュレートされたベンチマークデータセットで評価すると有望な結果が得られることを示唆している。これらのデータセットは典型的には一貫性のあるデータ分布とヒューリスティックまたは部分的にノイズのあるラベルによって特徴づけられる。本研究では、Juliet, Devign, BigVul, ICVulの4つの代表的なDLモデルであるReVealとLineVulを体系的に評価した。各モデルは各データセットで独立してトレーニングされ、コード表現はt-SNEを使用して分析され、脆弱性に関連するパターンを明らかにする。現実的な適用性を評価するため,これらのモデルを,Claude 3.5 Sonnet, GPT-o3-mini, GPT-4o, GPT-5という4つの事前訓練済みLLMとともに,Linuxカーネルから最近20(2025年5月)に修正された脆弱性を含むキュレートデータセットであるVentiVul上にデプロイする。実験の結果、現在のモデルでは、表現空間における脆弱なコードと非脆弱性なコードとを区別し、異なる分布を持つデータセット間での一般化に苦慮していることが明らかとなった。新たに構築したタイム・オブ・ディストリビューションデータセットであるVentiVulを評価すると、パフォーマンスが急激に低下し、ほとんどのモデルが脆弱性を確実に検出できなかった。これらの結果は,私たちのデプロイメント指向評価フレームワークの価値と,より堅牢なコード表現と高品質なデータセットの必要性を強調しながら,学術的なベンチマークと実世界のデプロイメントとの間に永続的なギャップを顕在化しています。

論文の概要: From Lab to Reality: A Practical Evaluation of Deep Learning Models and LLMs for Vulnerability Detection

関連論文リスト