XUAT-Copilot: Multi-Agent Collaborative System for Automated User
Acceptance Testing with Large Language Model
- URL: http://arxiv.org/abs/2401.02705v2
- Date: Wed, 10 Jan 2024 12:08:44 GMT
- Title: XUAT-Copilot: Multi-Agent Collaborative System for Automated User
Acceptance Testing with Large Language Model
- Authors: Zhitao Wang, Wei Wang, Zirao Li, Long Wang, Can Yi, Xinjie Xu, Luyang
Cao, Hanjing Su, Shouzhi Chen, Jun Zhou
- Abstract summary: We propose an LLM-powered multi-agent collaborative system, named XUAT-Copilot, for automated UAT.
The proposed system mainly consists of three LLM-based agents responsible for action planning, state checking and parameter selecting, respectively, and two additional modules for state sensing and case rewriting.
The system achieves a close effectiveness to human testers in our experimental studies and gains a significant improvement of Pass@1 accuracy compared with single-agent architecture.
- Score: 9.05375318147931
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In past years, we have been dedicated to automating user acceptance testing
(UAT) process of WeChat Pay, one of the most influential mobile payment
applications in China. A system titled XUAT has been developed for this
purpose. However, there is still a human-labor-intensive stage, i.e, test
scripts generation, in the current system. Therefore, in this paper, we
concentrate on methods of boosting the automation level of the current system,
particularly the stage of test scripts generation. With recent notable
successes, large language models (LLMs) demonstrate significant potential in
attaining human-like intelligence and there has been a growing research area
that employs LLMs as autonomous agents to obtain human-like decision-making
capabilities. Inspired by these works, we propose an LLM-powered multi-agent
collaborative system, named XUAT-Copilot, for automated UAT. The proposed
system mainly consists of three LLM-based agents responsible for action
planning, state checking and parameter selecting, respectively, and two
additional modules for state sensing and case rewriting. The agents interact
with testing device, make human-like decision and generate action command in a
collaborative way. The proposed multi-agent system achieves a close
effectiveness to human testers in our experimental studies and gains a
significant improvement of Pass@1 accuracy compared with single-agent
architecture. More importantly, the proposed system has launched in the formal
testing environment of WeChat Pay mobile app, which saves a considerable amount
of manpower in the daily development work.
Related papers
- Autonomous Microscopy Experiments through Large Language Model Agents [4.241267255764773]
Large language models (LLMs) have accelerated the development of self-driving laboratories (SDLs) for materials research.
Here, we introduce AILA (Artificially Intelligent Lab Assistant), a framework that automates atomic force microscopy (AFM) through LLM-driven agents.
Our systematic assessment shows that state-of-the-art language models struggle even with basic tasks such as documentation retrieval.
arXiv Detail & Related papers (2024-12-18T09:35:28Z) - ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding [9.921932789361732]
We present a novel evaluation dataset, ProMQA, to measure system advancements in application-oriented scenarios.
ProMQA consists of 401 multimodal procedural QA pairs on user recording of procedural activities coupled with their corresponding instruction.
arXiv Detail & Related papers (2024-10-29T16:39:28Z) - SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation [89.24729958546168]
We present SPA-Bench, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents.
SPA-Bench offers three key contributions: A diverse set of tasks covering system and third-party apps in both English and Chinese, focusing on features commonly used in daily routines.
A novel evaluation pipeline that automatically assesses agent performance across multiple dimensions, encompassing seven metrics related to task completion and resource consumption.
arXiv Detail & Related papers (2024-10-19T17:28:48Z) - Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance [95.03771007780976]
We tackle the challenge of developing proactive agents capable of anticipating and initiating tasks without explicit human instructions.
First, we collect real-world human activities to generate proactive task predictions.
These predictions are labeled by human annotators as either accepted or rejected.
The labeled data is used to train a reward model that simulates human judgment.
arXiv Detail & Related papers (2024-10-16T08:24:09Z) - Agent-as-a-Judge: Evaluate Agents with Agents [61.33974108405561]
We introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems.
This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process.
We present DevAI, a new benchmark of 55 realistic automated AI development tasks.
arXiv Detail & Related papers (2024-10-14T17:57:02Z) - AutoPenBench: Benchmarking Generative Agents for Penetration Testing [42.681170697805726]
This paper introduces AutoPenBench, an open benchmark for evaluating generative agents in automated penetration testing.
We present a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack.
We show the benefits of AutoPenBench by testing two agent architectures: a fully autonomous and a semi-autonomous supporting human interaction.
arXiv Detail & Related papers (2024-10-04T08:24:15Z) - ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems [80.69865295743149]
This work attempts to study using LLM-based agents to design collaborative AI systems autonomously.
Based on ComfyBench, we develop ComfyAgent, a framework that empowers agents to autonomously design collaborative AI systems by generating.
While ComfyAgent achieves a comparable resolve rate to o1-preview and significantly surpasses other agents on ComfyBench, ComfyAgent has resolved only 15% of creative tasks.
arXiv Detail & Related papers (2024-09-02T17:44:10Z) - EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms [55.77492625524141]
EvoAgent is a generic method to automatically extend specialized agents to multi-agent systems.
We show that EvoAgent can significantly enhance the task-solving capability of LLM-based agents.
arXiv Detail & Related papers (2024-06-20T11:49:23Z) - A Large Language Model-based multi-agent manufacturing system for intelligent shopfloor [10.776483342326904]
Large Language Model-based (LLM) multi-agent manufacturing system for intelligent shopfloor is proposed in this study.
This system delineates the diverse agents and defines their methods.
The negotiation between BAs and BIA is the most crucial step in connecting manufacturing resources.
arXiv Detail & Related papers (2024-05-27T07:10:04Z) - Benchmarking Mobile Device Control Agents across Diverse Configurations [19.01954948183538]
B-MoCA is a benchmark for evaluating and developing mobile device control agents.
We benchmark diverse agents, including agents employing large language models (LLMs) or multi-modal LLMs.
While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to improve effectiveness.
arXiv Detail & Related papers (2024-04-25T14:56:32Z) - SpeechAgents: Human-Communication Simulation with Multi-Modal
Multi-Agent Systems [53.94772445896213]
Large Language Model (LLM)-based multi-agent systems have demonstrated promising performance in simulating human society.
We propose SpeechAgents, a multi-modal LLM based multi-agent system designed for simulating human communication.
arXiv Detail & Related papers (2024-01-08T15:01:08Z) - ProAgent: From Robotic Process Automation to Agentic Process Automation [87.0555252338361]
Large Language Models (LLMs) have emerged human-like intelligence.
This paper introduces Agentic Process Automation (APA), a groundbreaking automation paradigm using LLM-based agents for advanced automation.
We then instantiate ProAgent, an agent designed to craft from human instructions and make intricate decisions by coordinating specialized agents.
arXiv Detail & Related papers (2023-11-02T14:32:16Z) - A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration [55.35849138235116]
We propose automatically selecting a team of agents from candidates to collaborate in a dynamic communication structure toward different tasks and domains.
Specifically, we build a framework named Dynamic LLM-Powered Agent Network ($textDyLAN$) for LLM-powered agent collaboration.
We demonstrate that DyLAN outperforms strong baselines in code generation, decision-making, general reasoning, and arithmetic reasoning tasks with moderate computational cost.
arXiv Detail & Related papers (2023-10-03T16:05:48Z) - User Simulation with Large Language Models for Evaluating Task-Oriented
Dialogue [10.336443286833145]
We propose a novel user simulator built using recently developed large pretrained language models (LLMs)
Unlike previous work, which sought to maximize goal success rate (GSR) as the primary metric of simulator performance, our goal is a system which achieves a GSR similar to that observed in human interactions with TOD systems.
arXiv Detail & Related papers (2023-09-23T02:04:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.