Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering
- URL: http://arxiv.org/abs/2601.22952v1
- Date: Fri, 30 Jan 2026 13:14:55 GMT
- Title: Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering
- Authors: Yunpeng Xiong, Ting Zhang,
- Abstract summary: Static Application Security Testing (SAST) tools are essential for identifying software vulnerabilities.<n>SAST tools often produce a high volume of false positives (FPs)<n>Recent advances in Large Language Model (LLM) agents offer a promising direction.
- Score: 2.5335007441696384
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Static Application Security Testing (SAST) tools are essential for identifying software vulnerabilities, but they often produce a high volume of false positives (FPs), imposing a substantial manual triage burden on developers. Recent advances in Large Language Model (LLM) agents offer a promising direction by enabling iterative reasoning, tool use, and environment interaction to refine SAST alerts. However, the comparative effectiveness of different LLM-based agent architectures for FP filtering remains poorly understood. In this paper, we present a comparative study of three state-of-the-art LLM-based agent frameworks, i.e., Aider, OpenHands, and SWE-agent, for vulnerability FP filtering. We evaluate these frameworks using the vulnerabilities from the OWASP Benchmark and real-world open-source Java projects. The experimental results show that LLM-based agents can remove the majority of SAST noise, reducing an initial FP detection rate of over 92% on the OWASP Benchmark to as low as 6.3% in the best configuration. On real-world dataset, the best configuration of LLM-based agents can achieve an FP identification rate of up to 93.3% involving CodeQL alerts. However, the benefits of agents are strongly backbone- and CWE-dependent: agentic frameworks significantly outperform vanilla prompting for stronger models such as Claude Sonnet 4 and GPT-5, but yield limited or inconsistent gains for weaker backbones. Moreover, aggressive FP reduction can come at the cost of suppressing true vulnerabilities, highlighting important trade-offs. Finally, we observe large disparities in computational cost across agent frameworks. Overall, our study demonstrates that LLM-based agents are a powerful but non-uniform solution for SAST FP filtering, and that their practical deployment requires careful consideration of agent design, backbone model choice, vulnerability category, and operational cost.
Related papers
- AWE: Adaptive Agents for Dynamic Web Penetration Testing [0.0]
AWE is a memory-augmented multi-agent framework for autonomous web penetration testing.<n>It embeds structured, vulnerability-specific analysis pipelines within a lightweight LLM orchestration layer.<n>AWE achieves substantial gains on injection-class vulnerabilities.
arXiv Detail & Related papers (2026-03-01T07:32:42Z) - An Empirical Evaluation of LLM-Based Approaches for Code Vulnerability Detection: RAG, SFT, and Dual-Agent Systems [1.5216960763930782]
The rapid advancement of Large Language Models (LLMs) presents new opportunities for automated software vulnerability detection.<n>This paper presents a comparative study on the effectiveness of LLM-based techniques for detecting software vulnerabilities.
arXiv Detail & Related papers (2026-01-01T08:05:51Z) - Structured Uncertainty guided Clarification for LLM Agents [126.26213027785813]
LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures.<n>We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy.<n>Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39% while reducing clarification questions by 1.5-2.7$times$ compared to strong prompting and uncertainty-based baselines.
arXiv Detail & Related papers (2025-11-11T21:50:44Z) - From LLMs to Agents: A Comparative Evaluation of LLMs and LLM-based Agents in Security Patch Detection [42.089851895083804]
Large language models (LLMs) and LLM-based agents have demonstrated remarkable capabilities in various software engineering tasks.<n>We conduct a comprehensive evaluation of the performance of LLMs and LLM-based agents for security patch detection.<n>Our findings reveal that the Data-Aug LLM achieves the best overall performance, whereas the ReAct Agent demonstrates the lowest false positive rate (FPR)
arXiv Detail & Related papers (2025-11-11T09:58:41Z) - Self-Abstraction from Grounded Experience for Plan-Guided Policy Refinement [61.35824395228412]
Large language model (LLM) based agents are increasingly used to tackle software engineering tasks.<n>We propose Self-Abstraction from Grounded Experience (SAGE), a framework that enables agents to learn from their own task executions.
arXiv Detail & Related papers (2025-11-08T08:49:38Z) - SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents [63.70653857721785]
We conduct two in-the-wild experiments to demonstrate the prevalence of low-quality search results and their potential to misguide agent behaviors.<n>To counter this threat, we introduce an automated red-teaming framework that is systematic, scalable, and cost-efficient.
arXiv Detail & Related papers (2025-09-28T07:05:17Z) - Towards Effective Complementary Security Analysis using Large Language Models [3.203446435054805]
A key challenge in security analysis is the manual evaluation of potential security weaknesses generated by static application security testing (SAST) tools.<n>We propose using Large Language Models (LLMs) to improve the assessment of SAST findings.
arXiv Detail & Related papers (2025-06-20T10:46:35Z) - Evaluating the efficacy of LLM Safety Solutions : The Palit Benchmark Dataset [0.46040036610482665]
Large Language Models (LLMs) are increasingly integrated into critical systems in industries like healthcare and finance.<n>This gives rise to a range of attacks in which a user submits a malicious query and the LLM-system outputs a response that creates harm to the owner.<n>While security tools are being developed to counter these threats, there is little formal evaluation of their effectiveness and usability.
arXiv Detail & Related papers (2025-05-19T12:12:00Z) - AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents [54.29555239363013]
We propose a generic black-box fuzzing framework, AgentVigil, to automatically discover and exploit indirect prompt injection vulnerabilities.<n>We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o.<n>We apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
arXiv Detail & Related papers (2025-05-09T07:40:17Z) - AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage.<n>We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility.<n> Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z) - Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization [61.02719787737867]
Large language models (LLMs) are increasingly deployed and democratized on edge devices.<n>One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM.<n>We conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings.
arXiv Detail & Related papers (2025-02-06T18:59:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.