Fugu-MT 論文翻訳(概要): Large Language Models Cannot Reliably Detect Vulnerabilities in JavaScript: The First Systematic Benchmark and Evaluation

論文の概要: Large Language Models Cannot Reliably Detect Vulnerabilities in JavaScript: The First Systematic Benchmark and Evaluation

arxiv url: http://arxiv.org/abs/2512.01255v1
Date: Mon, 01 Dec 2025 04:00:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-02 19:46:34.678725
Title: Large Language Models Cannot Reliably Detect Vulnerabilities in JavaScript: The First Systematic Benchmark and Evaluation
Title（参考訳）: JavaScriptの脆弱性を確実に検出できない大規模言語モデル - 最初のシステムベンチマークと評価
Authors: Qingyuan Fei, Xin Liu, Song Li, Shujiang Wu, Jianwei Hou, Ping Chen, Zifeng Kang,
Abstract要約: 我々は、JavaScriptの脆弱性検出のためのベンチマークを構築するための3つの原則を紹介した。最初の自動ベンチマーク生成フレームワークFOGEJSを提案する。我々は,JavaScriptの脆弱性検出のための大規模言語モデルの最初の体系的評価を行う。
参考スコア（独自算出の注目度）: 8.85349227459794
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Researchers have proposed numerous methods to detect vulnerabilities in JavaScript, especially those assisted by Large Language Models (LLMs). However, the actual capability of LLMs in JavaScript vulnerability detection remains questionable, necessitating systematic evaluation and comprehensive benchmarks. Unfortunately, existing benchmarks suffer from three critical limitations: (1) incomplete coverage, such as covering a limited subset of CWE types; (2) underestimation of LLM capabilities caused by unreasonable ground truth labeling; and (3) overestimation due to unrealistic cases such as using isolated vulnerable files rather than complete projects. In this paper, we introduce, for the first time, three principles for constructing a benchmark for JavaScript vulnerability detection that directly address these limitations: (1) comprehensiveness, (2) no underestimation, and (3) no overestimation. Guided by these principles, we propose FORGEJS, the first automatic benchmark generation framework for evaluating LLMs' capability in JavaScript vulnerability detection. Then, we use FORGEJS to construct ARENAJS-the first systematic benchmark for LLM-based JavaScript vulnerability detection-and further propose JUDGEJS, an automatic evaluation framework. We conduct the first systematic evaluation of LLMs for JavaScript vulnerability detection, leveraging JUDGEJS to assess seven popular commercial LLMs on ARENAJS. The results show that LLMs not only exhibit limited reasoning capabilities, but also suffer from severe robustness defects, indicating that reliable JavaScript vulnerability detection with LLMs remains an open challenge.
Abstract（参考訳）: 研究者はJavaScriptの脆弱性を検出するために、特にLarge Language Models(LLMs)によって支援された多くの方法を提案している。しかし、JavaScriptの脆弱性検出におけるLLMの実際の能力は疑問の余地があり、体系的な評価と包括的なベンチマークが必要である。残念ながら、既存のベンチマークは、(1)CWEタイプの限定的なサブセットをカバーするなど、不完全なカバレッジ、(2)理にかなわない真実のラベル付けによるLLM機能の過小評価、(3)完全なプロジェクトではなく、孤立した脆弱なファイルを使用するといった非現実的なケースによる過大評価の3つの限界に悩まされている。本稿では,(1)包括性,(2)過小評価,(3)過大評価の3つの原則を紹介する。これらの原則によって導かれたFOGEJSは,JavaScriptの脆弱性検出におけるLLMの能力を評価するための,最初の自動ベンチマーク生成フレームワークである。次に、ForGEJSを用いて、ALNAJS - LLMベースのJavaScript脆弱性検出のための最初のシステマティックベンチマークを構築し、さらに自動評価フレームワークであるJUDGEJSを提案する。 JUDGEJSを利用して、ARENAJS上で人気の高い商業LLMを7つ評価する。その結果、LLMは限定的な推論能力を示すだけでなく、深刻な堅牢性欠陥に悩まされていることが示され、LLMによる信頼性の高いJavaScript脆弱性検出は依然としてオープンな課題であることが示された。

論文の概要: Large Language Models Cannot Reliably Detect Vulnerabilities in JavaScript: The First Systematic Benchmark and Evaluation

関連論文リスト