Related papers: Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection

Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection

URL: http://arxiv.org/abs/2512.22306v1
Date: Fri, 26 Dec 2025 05:43:35 GMT
Title: Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection
Authors: Chinmay Pushkar, Sanchit Kabra, Dhruv Kumar, Jagat Sesh Challa,
Abstract summary: We introduce a benchmark for Multi-Vulnerability Detection across four major languages: C, C++, Python, and JavaScript.<n>We construct a dataset of 40,000 files by injecting controlled counts of vulnerabilities into long-context code samples.<n>Our results reveal a sharp degradation in performance as vulnerability density increases.
Score: 1.2802720336459552
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have demonstrated significant potential in automated software security, particularly in vulnerability detection. However, existing benchmarks primarily focus on isolated, single-vulnerability samples or function-level classification, failing to reflect the complexity of real-world software where multiple interacting vulnerabilities often coexist within large files. Recent studies indicate that LLMs suffer from "count bias" and "selection bias" in multi-label tasks, yet this has not been rigorously quantified in the domain of code security. In this work, we introduce a comprehensive benchmark for Multi-Vulnerability Detection across four major languages: C, C++, Python, and JavaScript. We construct a dataset of 40,000 files by systematically injecting controlled counts of vulnerabilities (1, 3, 5, and 9) into long-context code samples (7.5k-10k tokens) sourced from CodeParrot. We evaluate five state-of-the-art LLMs, including GPT-4o-mini, Llama-3.3-70B, and the Qwen-2.5 series. Our results reveal a sharp degradation in performance as vulnerability density increases. While Llama-3.3-70B achieves near-perfect F1 scores (approximately 0.97) on single-vulnerability C tasks, performance drops by up to 40% in high-density settings. Notably, Python and JavaScript show distinct failure modes compared to C/C++, with models exhibiting severe "under-counting" (Recall dropping to less than 0.30) in complex Python files.

Related papers

Diverse LLMs vs. Vulnerabilities: Who Detects and Fixes Them Better? [1.0026496861838445]
DVDR-LLM is an ensemble framework that combines outputs from diverse large language models.<n>Our evaluation reveals that DVDR-LLM 10-12% higher detection accuracy compared to the average performance of individual models.
arXiv Detail & Related papers (2025-12-14T03:47:39Z)
Has the Two-Decade-Old Prophecy Come True? Artificial Bad Intelligence Triggered by Merely a Single-Bit Flip in Large Language Models [16.552905034341343]
Bit-Flip Attack (BFA) has garnered widespread attention for its ability to compromise software system integrity remotely through hardware fault injection.<n>This paper is the first to systematically discover and validate the existence of single-bit vulnerabilities in large language models (LLMs) using.gguf quantized formats.<n>At an attack frequency of 464.3 times per second, a single bit can be flipped with 100% success in as little as 31.7 seconds.
arXiv Detail & Related papers (2025-10-01T04:20:03Z)
A Multi-Language Object-Oriented Programming Benchmark for Large Language Models [61.267115598083315]
A survey of 35 existing benchmarks uncovers three major imbalances.<n>85.7% focus on a single programming language.<n>94.3% target only function-level or statement-level tasks.<n>Over 80% include fewer than ten test cases on average.
arXiv Detail & Related papers (2025-09-30T11:30:08Z)
Ensembling Large Language Models for Code Vulnerability Detection: An Empirical Evaluation [69.8237598448941]
This study investigates the potential of ensemble learning to enhance the performance of Large Language Models (LLMs) in source code vulnerability detection.<n>We propose Dynamic Gated Stacking (DGS), a Stacking variant tailored for vulnerability detection.
arXiv Detail & Related papers (2025-09-16T03:48:22Z)
Large Language Models Versus Static Code Analysis Tools: A Systematic Benchmark for Vulnerability Detection [0.0]
Three industry-standard rule-based static code-analysis tools (Sonar, CodeQL and Snyk Code) and three state-of-the-art large language models hosted on the GitHub Models platform (GPT-4.1, Mistral Large and DeepSeek V3) were evaluated.<n>Using a curated suite of ten real-world C# projects that embed 63 vulnerabilities, we measure classical accuracy (precision, recall, F-score), analysis latency, granularity and the developer effort required to vet true positives.<n>We recommend a hybrid pipeline: employ language models early in development for broad, context-aware detection and
arXiv Detail & Related papers (2025-08-06T13:48:38Z)
SecVulEval: Benchmarking LLMs for Real-World C/C++ Vulnerability Detection [8.440793630384546]
Large Language Models (LLMs) have shown promise in software engineering tasks.<n> evaluating their effectiveness in vulnerability detection is challenging due to the lack of high-quality datasets.<n>This benchmark includes 25,440 function samples covering 5,867 unique CVEs in C/C++ projects from 1999 to 2024.
arXiv Detail & Related papers (2025-05-26T11:06:03Z)
HackerRank-ASTRA: Evaluating Correctness & Consistency of Large Language Models on cross-domain multi-file project problems [2.4241401076864]
The HackerRank-ASTRA Benchmark introduces project-based coding problems that mirror real-world scenarios.<n>It evaluates model consistency through 32 runs (k = 32) and median standard deviation.<n>The top three models -- o1, o1-preview, and Claude-3.5-Sonnet-1022 -- achieved comparable average scores of 75%.
arXiv Detail & Related papers (2025-01-31T23:47:02Z)
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses.<n>Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies.<n>We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z)
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests.<n>First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics.<n>Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z)
DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs) It covers four major bug categories and 18 minor types in C++, Java, and Python. We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z)
VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code. Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph. VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.