Related papers: AutoEval: A Practical Framework for Autonomous Evaluation of Mobile Agents

AutoEval: A Practical Framework for Autonomous Evaluation of Mobile Agents

URL: http://arxiv.org/abs/2503.02403v1
Date: Tue, 04 Mar 2025 08:44:30 GMT
Title: AutoEval: A Practical Framework for Autonomous Evaluation of Mobile Agents
Authors: Jiahui Sun, Zhichao Hua, Yubin Xia,
Abstract summary: AutoEval is an autonomous agent evaluation framework that tests a mobile agent without any manual effort.<n>We implement a prototype of our framework and validate the automatically generated task reward signals, finding over 93% coverage to human-annotated reward signals.<n>We evaluate the state-of-the-art mobile agents using our framework, providing detailed insights into their performance characteristics and limitations.
Score: 5.515875179998062
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Accurate and systematic evaluation of mobile agents can significantly advance their development and real-world applicability. However, existing benchmarks for mobile agents lack practicality and scalability due to the extensive manual effort required to define task reward signals and implement corresponding evaluation codes. To this end, we propose AutoEval, an autonomous agent evaluation framework that tests a mobile agent without any manual effort. First, we design a Structured Substate Representation to describe the UI state changes while agent execution, such that task reward signals can be automatically generated. Second, we utilize a Judge System that can autonomously evaluate agents' performance given the automatically generated task reward signals. By providing only a task description, our framework evaluates agents with fine-grained performance feedback to that task without any extra manual effort. We implement a prototype of our framework and validate the automatically generated task reward signals, finding over 93% coverage to human-annotated reward signals. Moreover, to prove the effectiveness of our autonomous Judge System, we manually verify its judge results and demonstrate that it achieves 94% accuracy. Finally, we evaluate the state-of-the-art mobile agents using our framework, providing detailed insights into their performance characteristics and limitations.

Related papers

Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation [4.08768677009363]
We propose a generalizable, modular framework for evaluating agent task completion independent of the task domain.<n>We validate our framework by evaluating the Magentic-One Actor Agent on two benchmarks, GAIA and BigCodeBench.<n>Our Judge Agent predicts task success with closer agreement to human evaluations, achieving 4.76% and 10.52% higher alignment accuracy, respectively.
arXiv Detail & Related papers (2025-08-07T15:39:48Z)
Establishing Best Practices for Building Rigorous Agentic Benchmarks [94.69724201080155]
We show that many agentic benchmarks have issues in task setup or reward design.<n>Such issues can lead to under- or overestimation of agents' performance by up to 100% in relative terms.<n>We introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience.
arXiv Detail & Related papers (2025-07-03T17:35:31Z)
AutoLibra: Agent Metric Induction from Open-Ended Feedback [44.905607036805634]
AutoLibra is a framework for agent evaluation that transforms open-ended human feedback.<n>We experimentally demonstrate AutoLibra's ability to induce more concrete agent evaluation metrics.<n>We show that AutoLibra-induced metrics serve as better prompt-engineering targets than the task success rate.
arXiv Detail & Related papers (2025-05-05T17:47:49Z)
AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage. We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility. Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z)
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories [59.214178488091584]
We propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents.
arXiv Detail & Related papers (2025-04-11T19:49:22Z)
Agent-as-a-Judge: Evaluate Agents with Agents [61.33974108405561]
We introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We present DevAI, a new benchmark of 55 realistic automated AI development tasks.
arXiv Detail & Related papers (2024-10-14T17:57:02Z)
AutoPenBench: Benchmarking Generative Agents for Penetration Testing [42.681170697805726]
This paper introduces AutoPenBench, an open benchmark for evaluating generative agents in automated penetration testing. We present a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack. We show the benefits of AutoPenBench by testing two agent architectures: a fully autonomous and a semi-autonomous supporting human interaction.
arXiv Detail & Related papers (2024-10-04T08:24:15Z)
Multimodal Auto Validation For Self-Refinement in Web Agents [0.5843533603338313]
This paper introduces an approach to improving web agent performance through multi-modal validation and self-refinement. We present a comprehensive study of different modalities (text, vision) and the effect of hierarchy for the automatic validation of web agents. We also introduce a self-refinement mechanism for web automation, using the developed auto-validator, that enables web agents to detect and self-correct workflow failures.
arXiv Detail & Related papers (2024-10-01T13:43:55Z)
Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving [17.27549891731047]
We improve the reliability of agent behaviors by closed-loop fine-tuning of behavior models with reinforcement learning. Our method demonstrates improved overall performance, as well as improved targeted metrics such as collision rate. We present a novel policy evaluation benchmark to directly assess the ability of simulated agents to measure the quality of autonomous vehicle planners.
arXiv Detail & Related papers (2024-09-26T23:40:33Z)
Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations [77.31328397965653]
We introduce Ag2Manip (Agent-Agnostic representations for Manipulation), a framework aimed at surmounting challenges through two key innovations. A novel agent-agnostic visual representation derived from human manipulation videos, with the specifics of embodiments obscured to enhance generalizability. An agent-agnostic action representation abstracting a robot's kinematics to a universal agent proxy, emphasizing crucial interactions between end-effector and object.
arXiv Detail & Related papers (2024-04-26T16:40:17Z)
Autonomous Evaluation and Refinement of Digital Agents [57.12281122337407]
We show that domain-general automatic evaluators can significantly improve the performance of agents for web navigation and device control. We validate the performance of these models in several popular benchmarks for digital agents, finding between 74.4 and 92.9% agreement with oracle evaluation metrics.
arXiv Detail & Related papers (2024-04-09T17:25:47Z)
From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework [91.94389491920309]
Textual adversarial attacks can discover models' weaknesses by adding semantic-preserved but misleading perturbations to the inputs. The existing practice of robustness evaluation may exhibit issues of incomprehensive evaluation, impractical evaluation protocol, and invalid adversarial samples. We set up a unified automatic robustness evaluation framework, shifting towards model-centric evaluation to exploit the advantages of adversarial attacks.
arXiv Detail & Related papers (2023-05-29T14:55:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.