Multi-Agent Code Verification with Compound Vulnerability Detection
- URL: http://arxiv.org/abs/2511.16708v1
- Date: Thu, 20 Nov 2025 03:40:27 GMT
- Title: Multi-Agent Code Verification with Compound Vulnerability Detection
- Authors: Shreshth Rajan,
- Abstract summary: Existing tools only catch 65% of bugs with 35% false positives.<n>We built CodeX-Verify, a multi-agent system that uses four specialized agents to detect different types of bugs.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LLMs generate buggy code: 29.6% of SWE-bench "solved" patches fail, 62% of BaxBench solutions have vulnerabilities, and existing tools only catch 65% of bugs with 35% false positives. We built CodeX-Verify, a multi-agent system that uses four specialized agents to detect different types of bugs. We prove mathematically that combining agents with different detection patterns finds more bugs than any single agent when the agents look for different problems, confirmed by measuring agent correlation of p = 0.05--0.25. We also show that multiple vulnerabilities in the same code create exponentially more risk than previously thought--SQL injection plus exposed credentials creates 15x more danger (risk 300 vs. 20) than traditional models predict. Testing on 99 code samples with verified labels shows our system catches 76.1% of bugs, matching the best existing method while running faster and without test execution. We tested 15 different agent combinations and found that using multiple agents improves accuracy by 39.7 percentage points (from 32.8% to 72.4%) compared to single agents, with gains of +14.9pp, +13.5pp, and +11.2pp for agents 2, 3, and 4. The best two-agent combination reaches 79.3% accuracy. Testing on 300 real patches from Claude Sonnet 4.5 runs in under 200ms per sample, making this practical for production use.
Related papers
- MultiVer: Zero-Shot Multi-Agent Vulnerability Detection [0.0]
MultiVer is a zero-shot multi-agent system for vulnerability detection that achieves state-of-the-art recall without fine-tuning.<n>A four-agent ensemble with union voting achieves 82.7% recall on PyVul, exceeding fine-tuned GPT-3.5 (81.3%) by 1.4 percentage points.
arXiv Detail & Related papers (2026-02-19T22:20:17Z) - When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents [0.0]
ReAct-style agents produce 2.0--4.2 distinct action sequences per 10 runs on average, even with identical inputs.<n>Tasks with consistent behavior achieve 80--92% accuracy, while highly inconsistent tasks achieve only 25--60%.<n>Our results suggest that monitoring behavioral consistency during execution could enable early error detection and improve agent reliability.
arXiv Detail & Related papers (2026-02-12T06:15:14Z) - Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks [0.0]
Agentic AI introduces security vulnerabilities that traditional LLM safeguards fail to address.<n>We conduct the first systematic testing and comparative evaluation of agentic AI systems.<n>We identify six distinct defensive behavior patterns including a novel "hallucinated compliance" strategy.
arXiv Detail & Related papers (2025-12-16T19:22:50Z) - BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills [59.003563837981886]
High quality bugs are key to training the next generation of language model based software engineering (SWE) agents.<n>We introduce a novel method for synthetic generation of difficult and diverse bugs.
arXiv Detail & Related papers (2025-10-22T17:58:56Z) - Where LLM Agents Fail and How They can Learn From Failures [62.196870049524364]
Large Language Model (LLM) agents have shown promise in solving complex, multi-step tasks.<n>They amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions.<n>Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way.<n>We introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations.
arXiv Detail & Related papers (2025-09-29T18:20:27Z) - VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z) - BugGen: A Self-Correcting Multi-Agent LLM Pipeline for Realistic RTL Bug Synthesis [1.9291502706655312]
We introduce BugGen, a first of its kind, fully autonomous, multi-agent pipeline to generate, insert, and validate functional bugs in RTL.<n> BugGen partitions modules, selects mutation targets via a closed-loop agentic architecture, and employs iterative refinement and rollback mechanisms.<n> evaluated across five OpenTitan IP blocks, BugGen produced 500 unique bugs with 94% functional accuracy and achieved a throughput of 17.7 validated bugs per hour-over five times faster than typical manual expert insertion.
arXiv Detail & Related papers (2025-06-12T09:02:20Z) - BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems [62.17474934536671]
We introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems.<n>To capture the vulnerability lifecycle, we define three task types: Detect (detecting a new vulnerability), Exploit (exploiting a specific vulnerability), and Patch (patching a specific vulnerability)<n>We evaluate 8 agents: Claude Code, OpenAI Codex CLI with o3-high and o4-mini, and custom agents with o3-high, GPT-4.1, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet Thinking, and DeepSeek-R1.
arXiv Detail & Related papers (2025-05-21T07:44:52Z) - Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM-Based Agents [10.378745306569053]
VulTrial is a courtroom-inspired framework designed to enhance automated vulnerability detection.<n>It employs four role-specific agents, which are security researcher, code author, moderator, and review board.<n>Using GPT-3.5 and GPT-4o, VulTrial improves the performance by 102.39% and 84.17% over its respective baselines.
arXiv Detail & Related papers (2025-05-16T07:54:10Z) - Fine-Grained 1-Day Vulnerability Detection in Binaries via Patch Code Localization [12.73365645156957]
1-day vulnerabilities in binaries have become a major threat to software security.<n>patch presence test is one of the effective ways to detect the vulnerability.<n>We propose a novel approach named PLocator, which leverages stable values from both the patch code and its context.
arXiv Detail & Related papers (2025-01-29T04:35:37Z) - Evaluating Agent-based Program Repair at Google [9.62742759337993]
Agent-based program repair offers to automatically resolve complex bugs end-to-end.<n>Recent work has explored the use of agent-based repair approaches on the popular open-source SWE-Bench.<n>This paper explores the viability of using an agentic approach to address bugs in an enterprise context.
arXiv Detail & Related papers (2025-01-13T18:09:25Z) - On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents [58.79302663733703]
Large language model-based multi-agent systems have shown great abilities across various tasks due to the collaboration of expert agents.<n>The impact of clumsy or even malicious agents--those who frequently make errors in their tasks--on the overall performance of the system remains underexplored.<n>This paper investigates what is the resilience of various system structures under faulty agents on different downstream tasks.
arXiv Detail & Related papers (2024-08-02T03:25:20Z) - Malicious Agent Detection for Robust Multi-Agent Collaborative Perception [52.261231738242266]
Multi-agent collaborative (MAC) perception is more vulnerable to adversarial attacks than single-agent perception.
We propose Malicious Agent Detection (MADE), a reactive defense specific to MAC perception.
We conduct comprehensive evaluations on a benchmark 3D dataset V2X-sim and a real-road dataset DAIR-V2X.
arXiv Detail & Related papers (2023-10-18T11:36:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.