MultiVer: Zero-Shot Multi-Agent Vulnerability Detection
- URL: http://arxiv.org/abs/2602.17875v1
- Date: Thu, 19 Feb 2026 22:20:17 GMT
- Title: MultiVer: Zero-Shot Multi-Agent Vulnerability Detection
- Authors: Shreshth Rajan,
- Abstract summary: MultiVer is a zero-shot multi-agent system for vulnerability detection that achieves state-of-the-art recall without fine-tuning.<n>A four-agent ensemble with union voting achieves 82.7% recall on PyVul, exceeding fine-tuned GPT-3.5 (81.3%) by 1.4 percentage points.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present MultiVer, a zero-shot multi-agent system for vulnerability detection that achieves state-of-the-art recall without fine-tuning. A four-agent ensemble (security, correctness, performance, style) with union voting achieves 82.7% recall on PyVul, exceeding fine-tuned GPT-3.5 (81.3%) by 1.4 percentage points -- the first zeroshot system to surpass fine-tuned performance on this benchmark. On SecurityEval, the same architecture achieves 91.7% detection rate, matching specialized systems. The recall improvement comes at a precision cost: 48.8% precision versus 63.9% for fine-tuned baselines, yielding 61.4% F1. Ablation experiments isolate component contributions: the multi-agent ensemble adds 17 percentage points recall over single-agent security analysis. These results demonstrate that for security applications where false negatives are costlier than false positives, zero-shot multi-agent ensembles can match and exceed fine-tuned models on the metric that matters most.
Related papers
- Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration [0.0]
We answer with a reliability level -- a single number per system-task pair.<n>Self-consistency sampling reduces uncertainty exponentially.<n> conformal calibration guarantees correctness within 1/(n+1) of the target level.
arXiv Detail & Related papers (2026-02-24T21:03:50Z) - How well are open sourced AI-generated image detection models out-of-the-box: A comprehensive benchmark study [5.740397289924559]
No universal winner exists, with detector rankings exhibiting substantial instability.<n>Our findings challenge the one-size-fits-all'' detector paradigm.
arXiv Detail & Related papers (2026-02-08T04:36:13Z) - ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack [52.17935054046577]
We present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks.<n>ReasAlign incorporates structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks.
arXiv Detail & Related papers (2026-01-15T08:23:38Z) - Multi-Agent LLM Committees for Autonomous Software Beta Testing [0.0]
The framework combines model diversity, persona-driven behavioral variation, and visual user interface understanding.<n>Vision-enabled agents successfully identify user interface elements, with navigation and reporting achieving 100 percent success.<n>The framework enables reproducible research and practical deployment of LLM-based software testing in CI/CD pipelines.
arXiv Detail & Related papers (2025-12-21T02:06:53Z) - Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks [0.0]
Agentic AI introduces security vulnerabilities that traditional LLM safeguards fail to address.<n>We conduct the first systematic testing and comparative evaluation of agentic AI systems.<n>We identify six distinct defensive behavior patterns including a novel "hallucinated compliance" strategy.
arXiv Detail & Related papers (2025-12-16T19:22:50Z) - Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z) - Multi-Agent Code Verification with Compound Vulnerability Detection [0.0]
Existing tools only catch 65% of bugs with 35% false positives.<n>We built CodeX-Verify, a multi-agent system that uses four specialized agents to detect different types of bugs.
arXiv Detail & Related papers (2025-11-20T03:40:27Z) - VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z) - Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM-Based Agents [10.378745306569053]
VulTrial is a courtroom-inspired framework designed to enhance automated vulnerability detection.<n>It employs four role-specific agents, which are security researcher, code author, moderator, and review board.<n>Using GPT-3.5 and GPT-4o, VulTrial improves the performance by 102.39% and 84.17% over its respective baselines.
arXiv Detail & Related papers (2025-05-16T07:54:10Z) - Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z) - Patch-Level Contrasting without Patch Correspondence for Accurate and
Dense Contrastive Representation Learning [79.43940012723539]
ADCLR is a self-supervised learning framework for learning accurate and dense vision representation.
Our approach achieves new state-of-the-art performance for contrastive methods.
arXiv Detail & Related papers (2023-06-23T07:38:09Z) - G$^2$uardFL: Safeguarding Federated Learning Against Backdoor Attacks
through Attributed Client Graph Clustering [116.4277292854053]
Federated Learning (FL) offers collaborative model training without data sharing.
FL is vulnerable to backdoor attacks, where poisoned model weights lead to compromised system integrity.
We present G$2$uardFL, a protective framework that reinterprets the identification of malicious clients as an attributed graph clustering problem.
arXiv Detail & Related papers (2023-06-08T07:15:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.