Fugu-MT 論文翻訳(概要): Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering

論文の概要: Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering

arxiv url: http://arxiv.org/abs/2601.22952v1
Date: Fri, 30 Jan 2026 13:14:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-02 18:28:15.46236
Title: Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering
Title（参考訳）: 騒音低減のためのLLM剤の比較検討
Authors: Yunpeng Xiong, Ting Zhang,
Abstract要約: ソフトウェア脆弱性を特定するには静的アプリケーションセキュリティテスト(SAST)ツールが不可欠だ。 SASTツールは、しばしば大量の偽陽性(FP)を生成する。 LLM(Large Language Model)エージェントの最近の進歩は、有望な方向性を提供する。
参考スコア（独自算出の注目度）: 2.5335007441696384
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Static Application Security Testing (SAST) tools are essential for identifying software vulnerabilities, but they often produce a high volume of false positives (FPs), imposing a substantial manual triage burden on developers. Recent advances in Large Language Model (LLM) agents offer a promising direction by enabling iterative reasoning, tool use, and environment interaction to refine SAST alerts. However, the comparative effectiveness of different LLM-based agent architectures for FP filtering remains poorly understood. In this paper, we present a comparative study of three state-of-the-art LLM-based agent frameworks, i.e., Aider, OpenHands, and SWE-agent, for vulnerability FP filtering. We evaluate these frameworks using the vulnerabilities from the OWASP Benchmark and real-world open-source Java projects. The experimental results show that LLM-based agents can remove the majority of SAST noise, reducing an initial FP detection rate of over 92% on the OWASP Benchmark to as low as 6.3% in the best configuration. On real-world dataset, the best configuration of LLM-based agents can achieve an FP identification rate of up to 93.3% involving CodeQL alerts. However, the benefits of agents are strongly backbone- and CWE-dependent: agentic frameworks significantly outperform vanilla prompting for stronger models such as Claude Sonnet 4 and GPT-5, but yield limited or inconsistent gains for weaker backbones. Moreover, aggressive FP reduction can come at the cost of suppressing true vulnerabilities, highlighting important trade-offs. Finally, we observe large disparities in computational cost across agent frameworks. Overall, our study demonstrates that LLM-based agents are a powerful but non-uniform solution for SAST FP filtering, and that their practical deployment requires careful consideration of agent design, backbone model choice, vulnerability category, and operational cost.
Abstract（参考訳）: 静的アプリケーションセキュリティテスト(SAST)ツールは、ソフトウェアの脆弱性を特定するのに不可欠だが、しばしば大量の偽陽性(FP)を発生させ、開発者に対してかなりの手作業によるトリアージ負担を課す。大規模言語モデル(LLM)エージェントの最近の進歩は、反復的推論、ツール使用、環境相互作用を可能にして、SASTアラートを洗練することによって、有望な方向性を提供する。しかし、FPフィルタリングのための異なるLCMベースのエージェントアーキテクチャの比較効果は、まだよく分かっていない。本稿では, Aider, OpenHands, SWE-agent の脆弱性 FP フィルタリングのための3つの最先端 LLM エージェントフレームワークの比較検討を行う。 OWASP Benchmarkと現実世界のオープンソースJavaプロジェクトの脆弱性を利用して、これらのフレームワークを評価します。実験の結果, LLMをベースとしたエージェントは, SASTノイズの大部分を除去し, OWASPベンチマークの初期FP検出率を最大6.3%まで下げることができた。実世界のデータセットでは、LLMベースのエージェントの最高の設定は、CodeQLアラートを含む最大93.3%のFP識別率を達成することができる。しかし、エージェントの利点は、バックボーンとCWEに依存している。エージェントフレームワークは、クロードソネット4やGPT-5のようなより強力なモデルのためにバニラを著しく上回るが、弱いバックボーンに対して制限的または一貫性のないゲインをもたらす。さらに、攻撃的なFP削減は、真の脆弱性を抑えるコストを伴い、重要なトレードオフを浮き彫りにする。最後に,エージェントフレームワーク間の計算コストの相違を観察する。本研究は,LSMをベースとしたエージェントがSAST FPフィルタリングにおいて強力だが一様でないソリューションであることを示すとともに,エージェント設計,バックボーンモデル選択,脆弱性カテゴリ,運用コストを慎重に検討する必要があることを示す。

論文の概要: Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering

関連論文リスト