Fugu-MT 論文翻訳(概要): Ensembling Large Language Models for Code Vulnerability Detection: An Empirical Evaluation

論文の概要: Ensembling Large Language Models for Code Vulnerability Detection: An Empirical Evaluation

arxiv url: http://arxiv.org/abs/2509.12629v1
Date: Tue, 16 Sep 2025 03:48:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-17 17:50:52.866606
Title: Ensembling Large Language Models for Code Vulnerability Detection: An Empirical Evaluation
Title（参考訳）: コード脆弱性検出のための大規模言語モデルの構築:実証的評価
Authors: Zhihong Sun, Jia Li, Yao Wan, Chuanyi Li, Hongyu Zhang, Zhi jin, Ge Li, Hong Liu, Chen Lyu, Songlin Hu,
Abstract要約: 本研究では,ソースコードの脆弱性検出において,Large Language Models(LLM)の性能を高めるためのアンサンブル学習の可能性を検討する。脆弱性検出に適したスタック機能であるDynamic Gated Stacking (DGS)を提案する。
参考スコア（独自算出の注目度）: 69.8237598448941
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code vulnerability detection is crucial for ensuring the security and reliability of modern software systems. Recently, Large Language Models (LLMs) have shown promising capabilities in this domain. However, notable discrepancies in detection results often arise when analyzing identical code segments across different training stages of the same model or among architecturally distinct LLMs. While such inconsistencies may compromise detection stability, they also highlight a key opportunity: the latent complementarity among models can be harnessed through ensemble learning to create more robust vulnerability detection systems. In this study, we explore the potential of ensemble learning to enhance the performance of LLMs in source code vulnerability detection. We conduct comprehensive experiments involving five LLMs (i.e., DeepSeek-Coder-6.7B, CodeLlama-7B, CodeLlama-13B, CodeQwen1.5-7B, and StarCoder2-15B), using three ensemble strategies (i.e., Bagging, Boosting, and Stacking). These experiments are carried out across three widely adopted datasets (i.e., Devign, ReVeal, and BigVul). Inspired by Mixture of Experts (MoE) techniques, we further propose Dynamic Gated Stacking (DGS), a Stacking variant tailored for vulnerability detection. Our results demonstrate that ensemble approaches can significantly improve detection performance, with Boosting excelling in scenarios involving imbalanced datasets. Moreover, DGS consistently outperforms traditional Stacking, particularly in handling class imbalance and multi-class classification tasks. These findings offer valuable insights into building more reliable and effective LLM-based vulnerability detection systems through ensemble learning.
Abstract（参考訳）: コードの脆弱性検出は、現代のソフトウェアシステムのセキュリティと信頼性を保証するために不可欠である。最近、Large Language Models (LLM)がこの領域で有望な機能を示している。しかし、検出結果の顕著な相違は、同一モデルの異なる訓練段階、またはアーキテクチャ的に異なるLLM間で同一のコードセグメントを解析する際に生じることが多い。このような矛盾は検出安定性を損なう可能性があるが、それらはまた重要な機会を強調している: モデル間の潜在的な相補性は、より堅牢な脆弱性検出システムを構築するためにアンサンブル学習によって利用することができる。本研究では,ソースコード脆弱性検出におけるLLMの性能向上を目的としたアンサンブル学習の可能性を検討する。我々は,5つのLLM(DeepSeek-Coder-6.7B,CodeLlama-7B,CodeLlama-13B,CodeQwen1.5-7B,StarCoder2-15B)を,3つのアンサンブル戦略(Bagging,Booting,Stacking)を用いて包括的に実験する。これらの実験は、広く採用されている3つのデータセット(Devign、ReVeal、BigVul)で実施される。さらに,Mixture of Experts (MoE)技術に触発されて,脆弱性検出に適したスタック機能であるDynamic Gated Stacking (DGS)を提案する。この結果から,アンサンブルアプローチでは検出性能が大幅に向上し,不均衡なデータセットを含むシナリオにおいてBoostingは優れていた。さらにDGSは、特にクラス不均衡やマルチクラス分類タスクの処理において、従来のStackingよりも一貫して優れています。これらの発見は、アンサンブル学習を通じて、より信頼性が高く効果的なLSMベースの脆弱性検出システムを構築するための貴重な洞察を提供する。

論文の概要: Ensembling Large Language Models for Code Vulnerability Detection: An Empirical Evaluation

関連論文リスト