AutoEval: A Practical Framework for Autonomous Evaluation of Mobile Agents
- URL: http://arxiv.org/abs/2503.02403v1
- Date: Tue, 04 Mar 2025 08:44:30 GMT
- Title: AutoEval: A Practical Framework for Autonomous Evaluation of Mobile Agents
- Authors: Jiahui Sun, Zhichao Hua, Yubin Xia,
- Abstract summary: AutoEval is an autonomous agent evaluation framework that tests a mobile agent without any manual effort.<n>We implement a prototype of our framework and validate the automatically generated task reward signals, finding over 93% coverage to human-annotated reward signals.<n>We evaluate the state-of-the-art mobile agents using our framework, providing detailed insights into their performance characteristics and limitations.
- Score: 5.515875179998062
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate and systematic evaluation of mobile agents can significantly advance their development and real-world applicability. However, existing benchmarks for mobile agents lack practicality and scalability due to the extensive manual effort required to define task reward signals and implement corresponding evaluation codes. To this end, we propose AutoEval, an autonomous agent evaluation framework that tests a mobile agent without any manual effort. First, we design a Structured Substate Representation to describe the UI state changes while agent execution, such that task reward signals can be automatically generated. Second, we utilize a Judge System that can autonomously evaluate agents' performance given the automatically generated task reward signals. By providing only a task description, our framework evaluates agents with fine-grained performance feedback to that task without any extra manual effort. We implement a prototype of our framework and validate the automatically generated task reward signals, finding over 93% coverage to human-annotated reward signals. Moreover, to prove the effectiveness of our autonomous Judge System, we manually verify its judge results and demonstrate that it achieves 94% accuracy. Finally, we evaluate the state-of-the-art mobile agents using our framework, providing detailed insights into their performance characteristics and limitations.
Related papers
- A2Eval: Agentic and Automated Evaluation for Embodied Brain [26.357063836707223]
Current embodied VLM evaluation relies on static, expert-defined, manually annotated benchmarks.<n>Agentic Automatic Evaluation (A2Eval) is the first agentic framework that automates benchmark curation and evaluation through two collaborative agents.<n> Evaluated across 10 benchmarks and 13 models, A2Eval compresses evaluation suites by 85%, reduces overall computational costs by 77%, and delivers a 4.6x speedup.
arXiv Detail & Related papers (2026-02-02T04:55:27Z) - Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation [4.08768677009363]
We propose a generalizable, modular framework for evaluating agent task completion independent of the task domain.<n>We validate our framework by evaluating the Magentic-One Actor Agent on two benchmarks, GAIA and BigCodeBench.<n>Our Judge Agent predicts task success with closer agreement to human evaluations, achieving 4.76% and 10.52% higher alignment accuracy, respectively.
arXiv Detail & Related papers (2025-08-07T15:39:48Z) - Establishing Best Practices for Building Rigorous Agentic Benchmarks [94.69724201080155]
We show that many agentic benchmarks have issues in task setup or reward design.<n>Such issues can lead to under- or overestimation of agents' performance by up to 100% in relative terms.<n>We introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience.
arXiv Detail & Related papers (2025-07-03T17:35:31Z) - Test Automation for Interactive Scenarios via Promptable Traffic Simulation [48.240394447516664]
We introduce an automated method to generate realistic and safety-critical human behaviors for AV planner evaluation in interactive scenarios.<n>We parameterize complex human behaviors using low-dimensional goal positions, which are then fed into a promptable traffic simulator, ProSim.<n>To automate test generation, we introduce a prompt generation module that explores the goal domain and efficiently identifies safety-critical behaviors using Bayesian optimization.
arXiv Detail & Related papers (2025-06-01T22:29:32Z) - AutoLibra: Agent Metric Induction from Open-Ended Feedback [44.905607036805634]
AutoLibra is a framework for agent evaluation that transforms open-ended human feedback.<n>We experimentally demonstrate AutoLibra's ability to induce more concrete agent evaluation metrics.<n>We show that AutoLibra-induced metrics serve as better prompt-engineering targets than the task success rate.
arXiv Detail & Related papers (2025-05-05T17:47:49Z) - AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage.
We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility.
Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z) - The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models [53.12387628636912]
We propose an automatic evaluation framework that is validated against human annotations.<n>This approach was originally developed for the TREC Question Answering (QA) Track in 2003.<n>We observe strong agreement at the run level between scores derived from fully automatic nugget evaluation and human-based variants.
arXiv Detail & Related papers (2025-04-21T12:55:06Z) - AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories [59.214178488091584]
We propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents.
Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks.
We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents.
arXiv Detail & Related papers (2025-04-11T19:49:22Z) - AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World [45.70178627573973]
AutoEval is a system to autonomously evaluate robot policies around the clock with minimal human intervention.<n>We show that AutoEval can nearly fully eliminate human involvement in the evaluation process.<n>We provide public access to multiple AutoEval scenes in the popular BridgeData robot setup with WidowX robot arms.
arXiv Detail & Related papers (2025-03-31T16:23:44Z) - WorldSimBench: Towards Video Generation Models as World Simulators [79.69709361730865]
We classify the functionalities of predictive models into a hierarchy and take the first step in evaluating World Simulators by proposing a dual evaluation framework called WorldSimBench.
WorldSimBench includes Explicit Perceptual Evaluation and Implicit Manipulative Evaluation, encompassing human preference assessments from the visual perspective and action-level evaluations in embodied tasks.
Our comprehensive evaluation offers key insights that can drive further innovation in video generation models, positioning World Simulators as a pivotal advancement toward embodied artificial intelligence.
arXiv Detail & Related papers (2024-10-23T17:56:11Z) - Auto-PRE: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation [52.76508734756661]
Auto-PRE is an automatic evaluation framework inspired by the peer review process.<n>Unlike previous approaches that rely on human annotations, Auto-PRE automatically selects evaluators based on three core traits.<n> Experiments on three representative tasks, including summarization, non-factoid QA, and dialogue generation, demonstrate that Auto-PRE achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-10-16T06:06:06Z) - Agent-as-a-Judge: Evaluate Agents with Agents [61.33974108405561]
We introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems.
This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process.
We present DevAI, a new benchmark of 55 realistic automated AI development tasks.
arXiv Detail & Related papers (2024-10-14T17:57:02Z) - AutoPenBench: Benchmarking Generative Agents for Penetration Testing [42.681170697805726]
This paper introduces AutoPenBench, an open benchmark for evaluating generative agents in automated penetration testing.
We present a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack.
We show the benefits of AutoPenBench by testing two agent architectures: a fully autonomous and a semi-autonomous supporting human interaction.
arXiv Detail & Related papers (2024-10-04T08:24:15Z) - Multimodal Auto Validation For Self-Refinement in Web Agents [0.5843533603338313]
This paper introduces an approach to improving web agent performance through multi-modal validation and self-refinement.
We present a comprehensive study of different modalities (text, vision) and the effect of hierarchy for the automatic validation of web agents.
We also introduce a self-refinement mechanism for web automation, using the developed auto-validator, that enables web agents to detect and self-correct workflow failures.
arXiv Detail & Related papers (2024-10-01T13:43:55Z) - Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving [17.27549891731047]
We improve the reliability of agent behaviors by closed-loop fine-tuning of behavior models with reinforcement learning.
Our method demonstrates improved overall performance, as well as improved targeted metrics such as collision rate.
We present a novel policy evaluation benchmark to directly assess the ability of simulated agents to measure the quality of autonomous vehicle planners.
arXiv Detail & Related papers (2024-09-26T23:40:33Z) - Auditing an Automatic Grading Model with deep Reinforcement Learning [0.0]
We explore the use of deep reinforcement learning to audit an automatic short answer grading (ASAG) model.
We show that a high level of agreement to human ratings does not give sufficient evidence that an ASAG model is infallible.
arXiv Detail & Related papers (2024-05-11T20:07:09Z) - Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations [77.31328397965653]
We introduce Ag2Manip (Agent-Agnostic representations for Manipulation), a framework aimed at surmounting challenges through two key innovations.
A novel agent-agnostic visual representation derived from human manipulation videos, with the specifics of embodiments obscured to enhance generalizability.
An agent-agnostic action representation abstracting a robot's kinematics to a universal agent proxy, emphasizing crucial interactions between end-effector and object.
arXiv Detail & Related papers (2024-04-26T16:40:17Z) - Autonomous Evaluation and Refinement of Digital Agents [57.12281122337407]
We show that domain-general automatic evaluators can significantly improve the performance of agents for web navigation and device control.
We validate the performance of these models in several popular benchmarks for digital agents, finding between 74.4 and 92.9% agreement with oracle evaluation metrics.
arXiv Detail & Related papers (2024-04-09T17:25:47Z) - From Adversarial Arms Race to Model-centric Evaluation: Motivating a
Unified Automatic Robustness Evaluation Framework [91.94389491920309]
Textual adversarial attacks can discover models' weaknesses by adding semantic-preserved but misleading perturbations to the inputs.
The existing practice of robustness evaluation may exhibit issues of incomprehensive evaluation, impractical evaluation protocol, and invalid adversarial samples.
We set up a unified automatic robustness evaluation framework, shifting towards model-centric evaluation to exploit the advantages of adversarial attacks.
arXiv Detail & Related papers (2023-05-29T14:55:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.