Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection
- URL: http://arxiv.org/abs/2512.22306v1
- Date: Fri, 26 Dec 2025 05:43:35 GMT
- Title: Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection
- Authors: Chinmay Pushkar, Sanchit Kabra, Dhruv Kumar, Jagat Sesh Challa,
- Abstract summary: We introduce a benchmark for Multi-Vulnerability Detection across four major languages: C, C++, Python, and JavaScript.<n>We construct a dataset of 40,000 files by injecting controlled counts of vulnerabilities into long-context code samples.<n>Our results reveal a sharp degradation in performance as vulnerability density increases.
- Score: 1.2802720336459552
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated significant potential in automated software security, particularly in vulnerability detection. However, existing benchmarks primarily focus on isolated, single-vulnerability samples or function-level classification, failing to reflect the complexity of real-world software where multiple interacting vulnerabilities often coexist within large files. Recent studies indicate that LLMs suffer from "count bias" and "selection bias" in multi-label tasks, yet this has not been rigorously quantified in the domain of code security. In this work, we introduce a comprehensive benchmark for Multi-Vulnerability Detection across four major languages: C, C++, Python, and JavaScript. We construct a dataset of 40,000 files by systematically injecting controlled counts of vulnerabilities (1, 3, 5, and 9) into long-context code samples (7.5k-10k tokens) sourced from CodeParrot. We evaluate five state-of-the-art LLMs, including GPT-4o-mini, Llama-3.3-70B, and the Qwen-2.5 series. Our results reveal a sharp degradation in performance as vulnerability density increases. While Llama-3.3-70B achieves near-perfect F1 scores (approximately 0.97) on single-vulnerability C tasks, performance drops by up to 40% in high-density settings. Notably, Python and JavaScript show distinct failure modes compared to C/C++, with models exhibiting severe "under-counting" (Recall dropping to less than 0.30) in complex Python files.
Related papers
- Diverse LLMs vs. Vulnerabilities: Who Detects and Fixes Them Better? [1.0026496861838445]
DVDR-LLM is an ensemble framework that combines outputs from diverse large language models.<n>Our evaluation reveals that DVDR-LLM 10-12% higher detection accuracy compared to the average performance of individual models.
arXiv Detail & Related papers (2025-12-14T03:47:39Z) - Has the Two-Decade-Old Prophecy Come True? Artificial Bad Intelligence Triggered by Merely a Single-Bit Flip in Large Language Models [16.552905034341343]
Bit-Flip Attack (BFA) has garnered widespread attention for its ability to compromise software system integrity remotely through hardware fault injection.<n>This paper is the first to systematically discover and validate the existence of single-bit vulnerabilities in large language models (LLMs) using.gguf quantized formats.<n>At an attack frequency of 464.3 times per second, a single bit can be flipped with 100% success in as little as 31.7 seconds.
arXiv Detail & Related papers (2025-10-01T04:20:03Z) - A Multi-Language Object-Oriented Programming Benchmark for Large Language Models [61.267115598083315]
A survey of 35 existing benchmarks uncovers three major imbalances.<n>85.7% focus on a single programming language.<n>94.3% target only function-level or statement-level tasks.<n>Over 80% include fewer than ten test cases on average.
arXiv Detail & Related papers (2025-09-30T11:30:08Z) - Ensembling Large Language Models for Code Vulnerability Detection: An Empirical Evaluation [69.8237598448941]
This study investigates the potential of ensemble learning to enhance the performance of Large Language Models (LLMs) in source code vulnerability detection.<n>We propose Dynamic Gated Stacking (DGS), a Stacking variant tailored for vulnerability detection.
arXiv Detail & Related papers (2025-09-16T03:48:22Z) - Large Language Models Versus Static Code Analysis Tools: A Systematic Benchmark for Vulnerability Detection [0.0]
Three industry-standard rule-based static code-analysis tools (Sonar, CodeQL and Snyk Code) and three state-of-the-art large language models hosted on the GitHub Models platform (GPT-4.1, Mistral Large and DeepSeek V3) were evaluated.<n>Using a curated suite of ten real-world C# projects that embed 63 vulnerabilities, we measure classical accuracy (precision, recall, F-score), analysis latency, granularity and the developer effort required to vet true positives.<n>We recommend a hybrid pipeline: employ language models early in development for broad, context-aware detection and
arXiv Detail & Related papers (2025-08-06T13:48:38Z) - SecVulEval: Benchmarking LLMs for Real-World C/C++ Vulnerability Detection [8.440793630384546]
Large Language Models (LLMs) have shown promise in software engineering tasks.<n> evaluating their effectiveness in vulnerability detection is challenging due to the lack of high-quality datasets.<n>This benchmark includes 25,440 function samples covering 5,867 unique CVEs in C/C++ projects from 1999 to 2024.
arXiv Detail & Related papers (2025-05-26T11:06:03Z) - HackerRank-ASTRA: Evaluating Correctness & Consistency of Large Language Models on cross-domain multi-file project problems [2.4241401076864]
The HackerRank-ASTRA Benchmark introduces project-based coding problems that mirror real-world scenarios.<n>It evaluates model consistency through 32 runs (k = 32) and median standard deviation.<n>The top three models -- o1, o1-preview, and Claude-3.5-Sonnet-1022 -- achieved comparable average scores of 75%.
arXiv Detail & Related papers (2025-01-31T23:47:02Z) - AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses.<n>Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies.<n>We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests.<n>First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics.<n>Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z) - VELVET: a noVel Ensemble Learning approach to automatically locate
VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code.
Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph.
VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.