ZeroFalse: Improving Precision in Static Analysis with LLMs
- URL: http://arxiv.org/abs/2510.02534v1
- Date: Thu, 02 Oct 2025 20:07:25 GMT
- Title: ZeroFalse: Improving Precision in Static Analysis with LLMs
- Authors: Mohsen Iranmanesh, Sina Moradi Sabet, Sina Marefat, Ali Javidi Ghasr, Allison Wilson, Iman Sharafaldin, Mohammad A. Tayebi,
- Abstract summary: Static Application Security Testing (SAST) tools are integral to modern software development, yet their adoption is undermined by excessive false positives.<n>We present ZeroFalse, a framework that integrates static analysis with large language models (LLMs) to reduce false positives while preserving coverage.
- Score: 0.1759008116536278
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Static Application Security Testing (SAST) tools are integral to modern software development, yet their adoption is undermined by excessive false positives that weaken developer trust and demand costly manual triage. We present ZeroFalse, a framework that integrates static analysis with large language models (LLMs) to reduce false positives while preserving coverage. ZeroFalse treats static analyzer outputs as structured contracts, enriching them with flow-sensitive traces, contextual evidence, and CWE-specific knowledge before adjudication by an LLM. This design preserves the systematic reach of static analysis while leveraging the reasoning capabilities of LLMs. We evaluate ZeroFalse across both benchmarks and real-world projects using ten state-of-the-art LLMs. Our best-performing models achieve F1-scores of 0.912 on the OWASP Java Benchmark and 0.955 on the OpenVuln dataset, maintaining recall and precision above 90%. Results further show that CWE-specialized prompting consistently outperforms generic prompts, and reasoning-oriented LLMs provide the most reliable precision-recall balance. These findings position ZeroFalse as a practical and scalable approach for enhancing the reliability of SAST and supporting its integration into real-world CI/CD pipelines.
Related papers
- RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z) - NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems [53.52419750390942]
Large language models (LLMs) are used in mission-critical factual domains.<n>LLMs exhibit poor calibration performance due to noisy retrieved contexts.<n>We propose NAACL Rules (Noise-AwAre Confidence CaLibration Rules) to provide a principled foundation for resolving overconfidence under noise.
arXiv Detail & Related papers (2026-01-16T05:38:25Z) - Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking [64.97768177044355]
Large Language Models (LLMs) are increasingly deployed in real-world fact-checking systems.<n>We present FactArena, a fully automated arena-style evaluation framework.<n>Our analyses reveal significant discrepancies between static claim-verification accuracy and end-to-end fact-checking competence.
arXiv Detail & Related papers (2026-01-06T02:51:56Z) - LLM-Driven Adaptive Source-Sink Identification and False Positive Mitigation for Static Analysis [0.0]
textscAdaTaint adaptively infers source/sink specifications and filters spurious alerts through neuro-symbolic reasoning.<n>textscAdaTaint grounds model suggestions in program facts and constraint validation, ensuring both adaptability and determinism.<n>Results show that textscAdaTaint reduces false positives by textbf43.7% on average and improves recall by textbf11.2% compared to state-of-the-art baselines.
arXiv Detail & Related papers (2025-11-06T03:44:10Z) - Importance Sampling is All You Need: Predict LLM's performance on new benchmark by reusing existing benchmark [38.42021928363628]
Existing benchmarks face two major challenges: (1) the escalating cost of constructing high-quality test suites and reference solutions, and (2) the increasing risk of data contamination.<n>We propose BIS, a prompt-centric evaluation framework that enables ground-truth-free prediction of LLM performance on code generation tasks.<n>Our framework achieves an average absolute prediction error of 1.1% for code correctness scores, with best- and worst-case errors of 0.3% and 1.9%, respectively.
arXiv Detail & Related papers (2025-08-02T05:34:05Z) - Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask [30.819697001992154]
Large Language Models are a promising tool for automated vulnerability detection.<n>Despite widespread adoption, a critical question remains: Are LLMs truly effective at detecting real-world vulnerabilities?<n>This paper challenges three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales.
arXiv Detail & Related papers (2025-04-18T05:32:47Z) - The Hitchhiker's Guide to Program Analysis, Part II: Deep Thoughts by LLMs [17.497629884237647]
BugLens is a post-refinement framework that significantly enhances static analysis precision for bug detection.<n>LLMs demonstrate promising code understanding capabilities, but their direct application to program analysis remains unreliable.<n>LLMs guide LLMs through structured reasoning steps to assess security impact and validate constraints from the source code.
arXiv Detail & Related papers (2025-04-16T02:17:06Z) - CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection [2.5228276786940182]
This paper introduces CASTLE, a benchmarking framework for evaluating the vulnerability detection capabilities of different methods.<n>We assess 13 static analysis tools, 10 LLMs, and 2 formal verification tools using a hand-crafted dataset of 250 micro-benchmark programs covering 25 common CWEs.
arXiv Detail & Related papers (2025-03-12T14:30:05Z) - Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling [48.15636223774418]
Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness.<n>We propose the Explicit Knowledge Boundary Modeling framework to integrate fast and slow reasoning systems to harmonize reliability and usability.
arXiv Detail & Related papers (2025-03-04T03:16:02Z) - Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.