Related papers: The BrowserGym Ecosystem for Web Agent Research

The BrowserGym Ecosystem for Web Agent Research

URL: http://arxiv.org/abs/2412.05467v4
Date: Fri, 28 Feb 2025 16:02:27 GMT
Title: The BrowserGym Ecosystem for Web Agent Research
Authors: Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, Alexandre Lacoste,
Abstract summary: BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.<n>We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
Score: 151.90034093362343
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs). Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. In an earlier work, Drouin et al. (2024) introduced BrowserGym which aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature and includes AgentLab, a complementary framework that aids in agent creation, testing, and analysis. Our proposed ecosystem offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks made available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

Related papers

Evolutionary Perspectives on the Evaluation of LLM-Based AI Agents: A Comprehensive Survey [45.485318955120924]
The transition from traditional large language models (LLMs) to more advanced AI agents represents a pivotal evolutionary step.<n>Existing evaluation frameworks often blur the distinctions between LLM chatbots and AI agents, leading to confusion among researchers selecting appropriate benchmarks.<n>This paper introduces a systematic analysis of current evaluation approaches, grounded in an evolutionary perspective.
arXiv Detail & Related papers (2025-06-06T17:52:18Z)
WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback [74.82886755416949]
We identify key reasoning skills essential for effective web agents.<n>We reconstruct the agent's reasoning algorithms into chain-of-thought rationales.<n>Our approach yields significant improvements across multiple benchmarks.
arXiv Detail & Related papers (2025-05-26T14:03:37Z)
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites [9.58858258192147]
We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites. We also release a benchmark consisting of 112 practical tasks that mirror everyday complex user interactions. Our framework supports easy integration of new tasks, reproducible evaluation, and scalable post-training data generation.
arXiv Detail & Related papers (2025-04-15T18:22:55Z)
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories [59.214178488091584]
We propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents.
arXiv Detail & Related papers (2025-04-11T19:49:22Z)
Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.92083784393418]
Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance. We propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.
arXiv Detail & Related papers (2025-04-02T17:40:47Z)
WebGames: Challenging General-Purpose Web-Browsing AI Agents [11.320069795732058]
WebGames is a comprehensive benchmark suite designed to evaluate general-purpose web-browsing AI agents. We evaluate leading vision-language models including GPT-4o, Claude Computer-Use, Gemini-1.5-Pro, and Qwen2-VL against human performance. Results reveal a substantial capability gap, with the best AI system achieving only 43.1% success rate compared to human performance of 95.7%.
arXiv Detail & Related papers (2025-02-25T16:45:08Z)
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework. Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z)
SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement [18.84439000902905]
SWE-Search is a multi-agent framework that integrates Monte Carlo Tree Search (MCTS) with a self-improvement mechanism to enhance software agents' performance. This work highlights the potential of self-evaluation driven search techniques to enhance agent reasoning and planning in complex, dynamic software engineering environments.
arXiv Detail & Related papers (2024-10-26T22:45:56Z)
Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs [29.72874725703848]
We introduce two concepts: Benchmark+, which extends traditional question-answer benchmark into a more flexible "strategy-criterion" format; and Assessment+, which enhances the interaction process. We propose an agent-based evaluation framework called TestAgent, which implements these concepts through retrieval augmented generation and reinforcement learning.
arXiv Detail & Related papers (2024-10-15T11:20:42Z)
AutoPenBench: Benchmarking Generative Agents for Penetration Testing [42.681170697805726]
This paper introduces AutoPenBench, an open benchmark for evaluating generative agents in automated penetration testing. We present a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack. We show the benefits of AutoPenBench by testing two agent architectures: a fully autonomous and a semi-autonomous supporting human interaction.
arXiv Detail & Related papers (2024-10-04T08:24:15Z)
ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems [80.69865295743149]
This work attempts to study using LLM-based agents to design collaborative AI systems autonomously.<n>Based on ComfyBench, we develop ComfyAgent, a framework that empowers agents to autonomously design collaborative AI systems by generating.<n>While ComfyAgent achieves a comparable resolve rate to o1-preview and significantly surpasses other agents on ComfyBench, ComfyAgent has resolved only 15% of creative tasks.
arXiv Detail & Related papers (2024-09-02T17:44:10Z)
Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs) We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence. Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z)
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models [65.18602126334716]
Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots. We introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups.
arXiv Detail & Related papers (2024-01-25T03:33:18Z)
WebArena: A Realistic Web Environment for Building Autonomous Agents [92.3291458543633]
We build an environment for language-guided agents that is highly realistic and reproducible. We focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains. We release a set of benchmark tasks focusing on evaluating the functional correctness of task completions.
arXiv Detail & Related papers (2023-07-25T22:59:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.