Related papers: SastBench: A Benchmark for Testing Agentic SAST Triage

SastBench: A Benchmark for Testing Agentic SAST Triage

URL: http://arxiv.org/abs/2601.02941v1
Date: Tue, 06 Jan 2026 11:36:30 GMT
Title: SastBench: A Benchmark for Testing Agentic SAST Triage
Authors: Jake Feiglin, Guy Dar,
Abstract summary: SastBench is a benchmark for evaluating SAST triage agents that combines real CVEs as true positives with filtered SAST tool findings as approximate false positives.<n>We evaluate different agents on the benchmark and present a comparative analysis of their performance.
Score: 3.1175243456844832
License: http://creativecommons.org/licenses/by/4.0/
Abstract: SAST (Static Application Security Testing) tools are among the most widely used techniques in defensive cybersecurity, employed by commercial and non-commercial organizations to identify potential vulnerabilities in software. Despite their great utility, they generate numerous false positives, requiring costly manual filtering (aka triage). While LLM-powered agents show promise for automating cybersecurity tasks, existing benchmarks fail to emulate real-world SAST finding distributions. We introduce SastBench, a benchmark for evaluating SAST triage agents that combines real CVEs as true positives with filtered SAST tool findings as approximate false positives. SastBench features an agent-agnostic design. We evaluate different agents on the benchmark and present a comparative analysis of their performance, provide a detailed analysis of the dataset, and discuss the implications for future development.

Related papers

Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering [2.5335007441696384]
Static Application Security Testing (SAST) tools are essential for identifying software vulnerabilities.<n>SAST tools often produce a high volume of false positives (FPs)<n>Recent advances in Large Language Model (LLM) agents offer a promising direction.
arXiv Detail & Related papers (2026-01-30T13:14:55Z)
RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z)
ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack [52.17935054046577]
We present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks.<n>ReasAlign incorporates structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks.
arXiv Detail & Related papers (2026-01-15T08:23:38Z)
ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases [58.411135609139855]
"Shortcuts" to complete tasks pose significant risks for reliable assessment and deployment of large language models.<n>We introduce ImpossibleBench, a benchmark framework that measures LLM agents' propensity to exploit test cases.<n>As a practical framework, ImpossibleBench is not just an evaluation but a versatile tool.
arXiv Detail & Related papers (2025-10-23T06:58:32Z)
Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks? [58.48689960350828]
We show that a simple, modular and model-agnostic defense operating at the agent--tool interface achieves perfect security with high utility.<n>We employ a defense based on two firewalls: a Tool-Input Firewall (Minimizer) and a Tool-Output Firewall (Sanitizer)
arXiv Detail & Related papers (2025-10-06T18:09:02Z)
Establishing Best Practices for Building Rigorous Agentic Benchmarks [94.69724201080155]
We show that many agentic benchmarks have issues in task setup or reward design.<n>Such issues can lead to under- or overestimation of agents' performance by up to 100% in relative terms.<n>We introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience.
arXiv Detail & Related papers (2025-07-03T17:35:31Z)
AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage.<n>We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility.<n> Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z)
CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection [2.5228276786940182]
This paper introduces CASTLE, a benchmarking framework for evaluating the vulnerability detection capabilities of different methods.<n>We assess 13 static analysis tools, 10 LLMs, and 2 formal verification tools using a hand-crafted dataset of 250 micro-benchmark programs covering 25 common CWEs.
arXiv Detail & Related papers (2025-03-12T14:30:05Z)
AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs. Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z)
A Comprehensive Study on Static Application Security Testing (SAST) Tools for Android [22.558610938860124]
VulsTotal is a unified evaluation platform for defining and describing tools' supported vulnerability types. We select 11 free and open-sourced SAST tools from a pool of 97 existing options, adhering to clearly defined criteria. We then unify 67 general/common vulnerability types for Android SAST tools.
arXiv Detail & Related papers (2024-10-28T05:10:22Z)
Comparison of Static Application Security Testing Tools and Large Language Models for Repo-level Vulnerability Detection [11.13802281700894]
Static Application Security Testing (SAST) is usually utilized to scan source code for security vulnerabilities. Deep learning (DL)-based methods have demonstrated their potential in software vulnerability detection. This paper compares 15 diverse SAST tools with 12 popular or state-of-the-art open-source LLMs in detecting software vulnerabilities.
arXiv Detail & Related papers (2024-07-23T07:21:14Z)
An Extensive Comparison of Static Application Security Testing Tools [1.3927943269211593]
Static Application Security Testing Tools (SASTTs) identify software vulnerabilities to support the security and reliability of software applications. Several studies have suggested that alternative solutions may be more effective than SASTTs due to their tendency to generate false alarms. Our SASTTs evaluation is based on a controlled, though synthetic, Java.
arXiv Detail & Related papers (2024-03-14T09:37:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.