Autonomous Evaluation and Refinement of Digital Agents
- URL: http://arxiv.org/abs/2404.06474v3
- Date: Mon, 07 Oct 2024 14:19:37 GMT
- Title: Autonomous Evaluation and Refinement of Digital Agents
- Authors: Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, Alane Suhr,
- Abstract summary: We show that domain-general automatic evaluators can significantly improve the performance of agents for web navigation and device control.
We validate the performance of these models in several popular benchmarks for digital agents, finding between 74.4 and 92.9% agreement with oracle evaluation metrics.
- Score: 57.12281122337407
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We show that domain-general automatic evaluators can significantly improve the performance of agents for web navigation and device control. We experiment with multiple evaluation models that trade off between inference cost, modularity of design, and accuracy. We validate the performance of these models in several popular benchmarks for digital agents, finding between 74.4 and 92.9% agreement with oracle evaluation metrics. Finally, we use these evaluators to improve the performance of existing agents via fine-tuning and inference-time guidance. Without any additional supervision, we improve state-of-the-art performance by 29% on the popular benchmark WebArena, and achieve around 75% relative improvement in device control settings.
Related papers
- An End-to-End Multi-objective Ensemble Ranking Framework for Video Recommendation [20.59012057446529]
We propose a novel End-to-end Multi-objective Ensemble Ranking framework (EMER) for the multi-objective ensemble ranking module.<n>EMER enhances by replacing manually-designed formulas with an end-to-end modeling paradigm.<n>Our framework has been deployed in the primary scenarios of Kuaishou, a short video recommendation platform with hundreds of millions of daily active users.
arXiv Detail & Related papers (2025-08-07T07:21:46Z) - The Effects of Grouped Structural Global Pruning of Vision Transformers on Domain Generalisation [2.2124795371148616]
This paper introduces a novel grouped structural pruning method for pre-trained vision transformers (ViT, BeiT, and DeiT)
Our method uses dependency graph analysis to identify and remove redundant groups of neurons, weights, filters, or attention heads within transformers.
Results show significant improvements in inference speed and fine-tuning time with minimal trade-offs in accuracy and DG task performance.
arXiv Detail & Related papers (2025-04-05T15:05:36Z) - Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.92083784393418]
Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance.
We propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - An Illusion of Progress? Assessing the Current State of Web Agents [49.76769323750729]
We conduct a comprehensive and rigorous assessment of the current state of web agents.
Results depict a very different picture of the competency of current agents, suggesting over-optimism in previously reported results.
We introduce Online-Mind2Web, an online evaluation benchmark consisting of 300 diverse and realistic tasks spanning 136 websites.
arXiv Detail & Related papers (2025-04-02T05:51:29Z) - VideoGen-Eval: Agent-based System for Video Generation Evaluation [54.662739174367836]
Video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models.
We propose VideoGen-Eval, an agent evaluation system that integrates content structuring, MLLM-based content judgment, and patch tools for temporal-dense dimensions.
We introduce a video generation benchmark to evaluate existing cutting-edge models and verify the effectiveness of our evaluation system.
arXiv Detail & Related papers (2025-03-30T14:12:21Z) - AutoEval: A Practical Framework for Autonomous Evaluation of Mobile Agents [5.515875179998062]
AutoEval is an autonomous agent evaluation framework that tests a mobile agent without any manual effort.
We implement a prototype of our framework and validate the automatically generated task reward signals, finding over 93% coverage to human-annotated reward signals.
We evaluate the state-of-the-art mobile agents using our framework, providing detailed insights into their performance characteristics and limitations.
arXiv Detail & Related papers (2025-03-04T08:44:30Z) - The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.
We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.
We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z) - Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving [17.27549891731047]
We improve the reliability of agent behaviors by closed-loop fine-tuning of behavior models with reinforcement learning.
Our method demonstrates improved overall performance, as well as improved targeted metrics such as collision rate.
We present a novel policy evaluation benchmark to directly assess the ability of simulated agents to measure the quality of autonomous vehicle planners.
arXiv Detail & Related papers (2024-09-26T23:40:33Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Domain Adaptation of Transformer-Based Models using Unlabeled Data for
Relevance and Polarity Classification of German Customer Feedback [1.2999413717930817]
This work explores how efficient transformer-based models are when working with a German customer feedback dataset.
The experimental results show that transformer-based models can reach significant improvements compared to a fastText baseline.
arXiv Detail & Related papers (2022-12-12T08:32:28Z) - On the Limits of Evaluating Embodied Agent Model Generalization Using
Validation Sets [101.28658250723804]
This paper experiments with augmenting a transformer model with modules that effectively utilize a wider field of view and learn to choose whether the next step requires a navigation or manipulation action.
We observe that the proposed modules resulted in improved, and in fact state-of-the-art performance on an unseen validation set of a popular benchmark dataset, ALFRED.
We highlight this result as we believe it may be a wider phenomenon in machine learning tasks but primarily noticeable only in benchmarks that limit evaluations on test splits.
arXiv Detail & Related papers (2022-05-18T23:52:21Z) - DAPPER: Label-Free Performance Estimation after Personalization for
Heterogeneous Mobile Sensing [95.18236298557721]
We present DAPPER (Domain AdaPtation Performance EstimatoR) that estimates the adaptation performance in a target domain with unlabeled target data.
Our evaluation with four real-world sensing datasets compared against six baselines shows that DAPPER outperforms the state-of-the-art baseline by 39.8% in estimation accuracy.
arXiv Detail & Related papers (2021-11-22T08:49:33Z) - Control Distance IoU and Control Distance IoU Loss Function for Better
Bounding Box Regression [11.916482804759479]
We first present an evaluation-feedback module, which is proposed to consist of evaluation system and feedback mechanism.
Finally, we focus on both the evaluation system and the feedback mechanism, and propose Control Distance IoU and Control Distance IoU loss function.
arXiv Detail & Related papers (2021-03-22T09:57:25Z) - Our Evaluation Metric Needs an Update to Encourage Generalization [24.6240575061124]
Models that surpass human performance on popular benchmarks display significant degradation in performance on exposure to Out of Distribution data.
We propose a simple and novel evaluation metric, WOOD Score, that encourages generalization during evaluation.
arXiv Detail & Related papers (2020-07-14T08:15:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.