Autonomous Evaluation and Refinement of Digital Agents
- URL: http://arxiv.org/abs/2404.06474v2
- Date: Wed, 10 Apr 2024 04:55:54 GMT
- Title: Autonomous Evaluation and Refinement of Digital Agents
- Authors: Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, Alane Suhr,
- Abstract summary: We show that domain-general automatic evaluators can significantly improve the performance of agents for web navigation and device control.
We validate the performance of these models in several popular benchmarks for digital agents, finding between 74.4 and 92.9% agreement with oracle evaluation metrics.
- Score: 57.12281122337407
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We show that domain-general automatic evaluators can significantly improve the performance of agents for web navigation and device control. We experiment with multiple evaluation models that trade off between inference cost, modularity of design, and accuracy. We validate the performance of these models in several popular benchmarks for digital agents, finding between 74.4 and 92.9% agreement with oracle evaluation metrics. Finally, we use these evaluators to improve the performance of existing agents via fine-tuning and inference-time guidance. Without any additional supervision, we improve state-of-the-art performance by 29% on the popular benchmark WebArena, and achieve a 75% relative improvement in a challenging domain transfer scenario.
Related papers
- CORE-BEHRT: A Carefully Optimized and Rigorously Evaluated BEHRT [1.825224193230824]
We introduce CORE-BEHRT, a Carefully Optimized and Rigorously Evaluated BEHRT.
We show that improving data representation can increase the average downstream performance from 0.785 to 0.797 AUROC.
We observed significant performance increases in 17 out of 25 tasks and improvements in 24 tasks, highlighting the generalizability of our findings.
arXiv Detail & Related papers (2024-04-23T16:35:59Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Towards Evaluating Transfer-based Attacks Systematically, Practically,
and Fairly [79.07074710460012]
adversarial vulnerability of deep neural networks (DNNs) has drawn great attention.
An increasing number of transfer-based methods have been developed to fool black-box DNN models.
We establish a transfer-based attack benchmark (TA-Bench) which implements 30+ methods.
arXiv Detail & Related papers (2023-11-02T15:35:58Z) - Summarization from Leaderboards to Practice: Choosing A Representation
Backbone and Ensuring Robustness [21.567112955050582]
In both automatic and human evaluation, BART performs better than PEG and T5.
We find considerable variation in system output that can be captured only with human evaluation.
arXiv Detail & Related papers (2023-06-18T13:35:41Z) - GEO-Bench: Toward Foundation Models for Earth Monitoring [139.77907168809085]
We propose a benchmark comprised of six classification and six segmentation tasks.
This benchmark will be a driver of progress across a variety of Earth monitoring tasks.
arXiv Detail & Related papers (2023-06-06T16:16:05Z) - Domain Adaptation of Transformer-Based Models using Unlabeled Data for
Relevance and Polarity Classification of German Customer Feedback [1.2999413717930817]
This work explores how efficient transformer-based models are when working with a German customer feedback dataset.
The experimental results show that transformer-based models can reach significant improvements compared to a fastText baseline.
arXiv Detail & Related papers (2022-12-12T08:32:28Z) - DAPPER: Label-Free Performance Estimation after Personalization for
Heterogeneous Mobile Sensing [95.18236298557721]
We present DAPPER (Domain AdaPtation Performance EstimatoR) that estimates the adaptation performance in a target domain with unlabeled target data.
Our evaluation with four real-world sensing datasets compared against six baselines shows that DAPPER outperforms the state-of-the-art baseline by 39.8% in estimation accuracy.
arXiv Detail & Related papers (2021-11-22T08:49:33Z) - Control Distance IoU and Control Distance IoU Loss Function for Better
Bounding Box Regression [11.916482804759479]
We first present an evaluation-feedback module, which is proposed to consist of evaluation system and feedback mechanism.
Finally, we focus on both the evaluation system and the feedback mechanism, and propose Control Distance IoU and Control Distance IoU loss function.
arXiv Detail & Related papers (2021-03-22T09:57:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.