Related papers: MultiVer: Zero-Shot Multi-Agent Vulnerability Detection

MultiVer: Zero-Shot Multi-Agent Vulnerability Detection

URL: http://arxiv.org/abs/2602.17875v1
Date: Thu, 19 Feb 2026 22:20:17 GMT
Title: MultiVer: Zero-Shot Multi-Agent Vulnerability Detection
Authors: Shreshth Rajan,
Abstract summary: MultiVer is a zero-shot multi-agent system for vulnerability detection that achieves state-of-the-art recall without fine-tuning.<n>A four-agent ensemble with union voting achieves 82.7% recall on PyVul, exceeding fine-tuned GPT-3.5 (81.3%) by 1.4 percentage points.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present MultiVer, a zero-shot multi-agent system for vulnerability detection that achieves state-of-the-art recall without fine-tuning. A four-agent ensemble (security, correctness, performance, style) with union voting achieves 82.7% recall on PyVul, exceeding fine-tuned GPT-3.5 (81.3%) by 1.4 percentage points -- the first zeroshot system to surpass fine-tuned performance on this benchmark. On SecurityEval, the same architecture achieves 91.7% detection rate, matching specialized systems. The recall improvement comes at a precision cost: 48.8% precision versus 63.9% for fine-tuned baselines, yielding 61.4% F1. Ablation experiments isolate component contributions: the multi-agent ensemble adds 17 percentage points recall over single-agent security analysis. These results demonstrate that for security applications where false negatives are costlier than false positives, zero-shot multi-agent ensembles can match and exceed fine-tuned models on the metric that matters most.

Related papers

Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration [0.0]
We answer with a reliability level -- a single number per system-task pair.<n>Self-consistency sampling reduces uncertainty exponentially.<n> conformal calibration guarantees correctness within 1/(n+1) of the target level.
arXiv Detail & Related papers (2026-02-24T21:03:50Z)
How well are open sourced AI-generated image detection models out-of-the-box: A comprehensive benchmark study [5.740397289924559]
No universal winner exists, with detector rankings exhibiting substantial instability.<n>Our findings challenge the one-size-fits-all'' detector paradigm.
arXiv Detail & Related papers (2026-02-08T04:36:13Z)
ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack [52.17935054046577]
We present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks.<n>ReasAlign incorporates structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks.
arXiv Detail & Related papers (2026-01-15T08:23:38Z)
Multi-Agent LLM Committees for Autonomous Software Beta Testing [0.0]
The framework combines model diversity, persona-driven behavioral variation, and visual user interface understanding.<n>Vision-enabled agents successfully identify user interface elements, with navigation and reporting achieving 100 percent success.<n>The framework enables reproducible research and practical deployment of LLM-based software testing in CI/CD pipelines.
arXiv Detail & Related papers (2025-12-21T02:06:53Z)
Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks [0.0]
Agentic AI introduces security vulnerabilities that traditional LLM safeguards fail to address.<n>We conduct the first systematic testing and comparative evaluation of agentic AI systems.<n>We identify six distinct defensive behavior patterns including a novel "hallucinated compliance" strategy.
arXiv Detail & Related papers (2025-12-16T19:22:50Z)
Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z)
Multi-Agent Code Verification with Compound Vulnerability Detection [0.0]
Existing tools only catch 65% of bugs with 35% false positives.<n>We built CodeX-Verify, a multi-agent system that uses four specialized agents to detect different types of bugs.
arXiv Detail & Related papers (2025-11-20T03:40:27Z)
VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z)
Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM-Based Agents [10.378745306569053]
VulTrial is a courtroom-inspired framework designed to enhance automated vulnerability detection.<n>It employs four role-specific agents, which are security researcher, code author, moderator, and review board.<n>Using GPT-3.5 and GPT-4o, VulTrial improves the performance by 102.39% and 84.17% over its respective baselines.
arXiv Detail & Related papers (2025-05-16T07:54:10Z)
Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z)
Patch-Level Contrasting without Patch Correspondence for Accurate and Dense Contrastive Representation Learning [79.43940012723539]
ADCLR is a self-supervised learning framework for learning accurate and dense vision representation. Our approach achieves new state-of-the-art performance for contrastive methods.
arXiv Detail & Related papers (2023-06-23T07:38:09Z)
G$^2$uardFL: Safeguarding Federated Learning Against Backdoor Attacks through Attributed Client Graph Clustering [116.4277292854053]
Federated Learning (FL) offers collaborative model training without data sharing. FL is vulnerable to backdoor attacks, where poisoned model weights lead to compromised system integrity. We present G$2$uardFL, a protective framework that reinterprets the identification of malicious clients as an attributed graph clustering problem.
arXiv Detail & Related papers (2023-06-08T07:15:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.