Related papers: Multi-Agent LLM Committees for Autonomous Software Beta Testing

Multi-Agent LLM Committees for Autonomous Software Beta Testing

URL: http://arxiv.org/abs/2512.21352v1
Date: Sun, 21 Dec 2025 02:06:53 GMT
Title: Multi-Agent LLM Committees for Autonomous Software Beta Testing
Authors: Sumanth Bharadwaj Hachalli Karanam, Dhiwahar Adhithya Kennady,
Abstract summary: The framework combines model diversity, persona-driven behavioral variation, and visual user interface understanding.<n>Vision-enabled agents successfully identify user interface elements, with navigation and reporting achieving 100 percent success.<n>The framework enables reproducible research and practical deployment of LLM-based software testing in CI/CD pipelines.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Manual software beta testing is costly and time-consuming, while single-agent large language model (LLM) approaches suffer from hallucinations and inconsistent behavior. We propose a multi-agent committee framework in which diverse vision-enabled LLMs collaborate through a three-round voting protocol to reach consensus on testing actions. The framework combines model diversity, persona-driven behavioral variation, and visual user interface understanding to systematically explore web applications. Across 84 experimental runs with 9 testing personas and 4 scenarios, multi-agent committees achieve an 89.5 percent overall task success rate. Configurations with 2 to 4 agents reach 91.7 to 100 percent success, compared to 78.0 percent for single-agent baselines, yielding improvements of 13.7 to 22.0 percentage points. At the action level, the system attains a 93.1 percent success rate with a median per-action latency of 0.71 seconds, enabling real-time and continuous integration testing. Vision-enabled agents successfully identify user interface elements, with navigation and reporting achieving 100 percent success and form filling achieving 99.2 percent success. We evaluate the framework on WebShop and OWASP benchmarks, achieving 74.7 percent success on WebShop compared to a 50.1 percent published GPT-3 baseline, and 82.0 percent success on OWASP Juice Shop security testing with coverage of 8 of the 10 OWASP Top 10 vulnerability categories. Across 20 injected regressions, the committee achieves an F1 score of 0.91 for bug detection, compared to 0.78 for single-agent baselines. The open-source implementation enables reproducible research and practical deployment of LLM-based software testing in CI/CD pipelines.

Related papers

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development [6.072381417546439]
Existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch.<n>We introduce Vibe Code Bench, a benchmark of 100 web application specifications with 964 browser-based substeps.<n>Our contributions include (1) a novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, (2) a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and (3) an evaluator alignment protocol with both cross-model and human annotation results.
arXiv Detail & Related papers (2026-03-04T21:00:33Z)
MultiVer: Zero-Shot Multi-Agent Vulnerability Detection [0.0]
MultiVer is a zero-shot multi-agent system for vulnerability detection that achieves state-of-the-art recall without fine-tuning.<n>A four-agent ensemble with union voting achieves 82.7% recall on PyVul, exceeding fine-tuned GPT-3.5 (81.3%) by 1.4 percentage points.
arXiv Detail & Related papers (2026-02-19T22:20:17Z)
Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance [4.424336158797069]
This paper compares five popular AI-powered coding assistants (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code)<n>Devin exhibits the only consistent positive trend in acceptance rate (+0.77% per week over 32 weeks)<n>Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates.
arXiv Detail & Related papers (2026-02-09T17:14:46Z)
Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z)
DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems [48.971606069204825]
DoVer is an intervention-driven debug framework for large language model (LLM)-based multi-agent systems.<n>It augments hypothesis generation with active verification through targeted interventions.<n>DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses.
arXiv Detail & Related papers (2025-12-07T09:23:48Z)
CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent [46.41047559759938]
Computer-using agents (CUAs) enable task completion through natural interaction with operating systems and software interfaces.<n> Reward models offer promising alternatives, but their effectiveness on CUA evaluation remains largely underexplored.<n>We present CUARewardBench, comprising four key contributions.
arXiv Detail & Related papers (2025-10-21T12:53:40Z)
MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models [10.977990951788422]
We introduce MacroBench, a code-first benchmark that evaluates whether LLMs can synthesize reusable browser-automation programs (macros) from natural-language goals by reading HTML/DOM and emitting Selenium.<n> MacroBench instantiates seven self-hosted sites covering 681 tasks across interaction complexity and targeting difficulty.<n>Across 2,636 model-task runs, we observe stratified success: GPT-4o-mini (96.8%), GPT-4o (95.3%), Gemini (89.0%), DeepSeek (83.4%)
arXiv Detail & Related papers (2025-10-05T21:15:11Z)
SCUBA: Salesforce Computer Use Benchmark [63.66753028386581]
SCUBA is a benchmark designed to evaluate computer-use agents on customer relationship management ( CRM) within the Salesforce platform.<n> SCUBA contains 300 task instances derived from real user interviews, spanning three primary personas, platform administrators, sales representatives, and service agents.<n>We benchmark a diverse set of agents under both zero-shot and demonstration-augmented settings.
arXiv Detail & Related papers (2025-09-30T16:48:49Z)
Multi-Agent Penetration Testing AI for the Web [3.93181912653522]
MAPTA is a multi-agent system for autonomous web application security assessment.<n>It combines large language model orchestration with tool-grounded execution and end-to-end exploit validation.<n>On the 104-challenge XBOW benchmark, MAPTA achieves 76.9% overall success.
arXiv Detail & Related papers (2025-08-28T14:14:24Z)
SafeSieve: From Heuristics to Experience in Progressive Pruning for LLM-based Multi-Agent Communication [19.633176635669397]
We present SafeSieve, a progressive and adaptive multi-agent pruning algorithm.<n>We show that SafeSieve achieves 94.01% average accuracy while reducing token usage by 12.4%-27.8%.<n>These results establish SafeSieve as a robust, efficient, and scalable framework for practical multi-agent systems.
arXiv Detail & Related papers (2025-08-15T13:44:50Z)
VAULT: Vigilant Adversarial Updates via LLM-Driven Retrieval-Augmented Generation for NLI [15.320553375828045]
VAULT is a fully automated adversarial RAG pipeline that uncovers and remedies weaknesses in NLI models.<n>VAULT consistently outperforms prior in-context adversarial methods by up to 2.0% across datasets.
arXiv Detail & Related papers (2025-08-01T14:22:54Z)
SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z)
The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.<n>We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.