EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments
- URL: http://arxiv.org/abs/2506.08136v2
- Date: Fri, 03 Oct 2025 05:54:30 GMT
- Title: EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments
- Authors: Zefang Liu, Yinzhu Quan,
- Abstract summary: EconWebArena is a benchmark for evaluating autonomous agents on complex, multimodal economic tasks in realistic web environments.<n>The benchmark comprises 360 curated tasks from 82 authoritative websites spanning domains such as macroeconomics, labor, finance, trade, and public policy.
- Score: 1.0026496861838445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce EconWebArena, a benchmark for evaluating autonomous agents on complex, multimodal economic tasks in realistic web environments. The benchmark comprises 360 curated tasks from 82 authoritative websites spanning domains such as macroeconomics, labor, finance, trade, and public policy. Each task challenges agents to navigate live websites, interpret structured and visual content, interact with real interfaces, and extract precise, time-sensitive data through multi-step workflows. We construct the benchmark by prompting multiple large language models (LLMs) to generate candidate tasks, followed by rigorous human curation to ensure clarity, feasibility, and source reliability. Unlike prior work, EconWebArena emphasizes fidelity to authoritative data sources and the need for grounded web-based economic reasoning. We evaluate a diverse set of state-of-the-art multimodal LLMs as web agents, analyze failure cases, and conduct ablation studies to assess the impact of visual grounding, plan-based reasoning, and interaction design. Our results reveal substantial performance gaps and highlight persistent challenges in grounding, navigation, and multimodal understanding, positioning EconWebArena as a rigorous testbed for economic web intelligence.
Related papers
- Promoting Sustainable Web Agents: Benchmarking and Estimating Energy Consumption through Empirical and Theoretical Analysis [9.631189259234931]
We show how different philosophies in web agent creation can severely impact the associated expended energy.<n>We highlight a lack of transparency regarding disclosing model parameters and processes used for some web agents as a limiting factor when estimating energy consumption.
arXiv Detail & Related papers (2025-11-06T15:59:59Z) - WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality [62.43165871914528]
We introduce WebDevJudge, a systematic benchmark for assessing LLM-as-a-judge performance in web development.<n>WebDevJudge comprises human preference labels over paired web implementations, annotated with structured and query-grounded rubrics.<n>In-depth analysis indicates this gap stems from fundamental model limitations, including failures in recognizing functional equivalence, verifying task feasibility, and mitigating bias.
arXiv Detail & Related papers (2025-10-21T12:16:04Z) - WebDS: An End-to-End Benchmark for Web-based Data Science [59.270670758607494]
WebDS is the first end-to-end web-based data science benchmark.<n>It comprises 870 web-based data science tasks across 29 diverse websites.<n>WebDS sets the stage for significant advances in the development of practically useful LLM-based data science.
arXiv Detail & Related papers (2025-08-02T06:39:59Z) - Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence [109.32705135051486]
Embodied Web Agents is a novel paradigm for AI agents that fluidly bridge the embodiment and web-scale reasoning.<n>We release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks.<n>Results reveal significant performance gaps between state-of-the-art AI systems and human capabilities.
arXiv Detail & Related papers (2025-06-18T17:58:17Z) - REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites [9.58858258192147]
We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites.<n>We also release a benchmark consisting of 112 practical tasks that mirror everyday complex user interactions.<n>Our framework supports easy integration of new tasks, reproducible evaluation, and scalable post-training data generation.
arXiv Detail & Related papers (2025-04-15T18:22:55Z) - An Illusion of Progress? Assessing the Current State of Web Agents [49.76769323750729]
We conduct a comprehensive and rigorous assessment of the current state of web agents.<n>Results depict a very different picture of the competency of current agents, suggesting over-optimism in previously reported results.<n>We introduce Online-Mind2Web, an online evaluation benchmark consisting of 300 diverse and realistic tasks spanning 136 websites.
arXiv Detail & Related papers (2025-04-02T05:51:29Z) - WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? [83.19032025950986]
We study the use of large language model-based agents for interacting with software via web browsers.
WorkArena is a benchmark of 33 tasks based on the widely-used ServiceNow platform.
BrowserGym is an environment for the design and evaluation of such agents.
arXiv Detail & Related papers (2024-03-12T14:58:45Z) - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [93.85005277463802]
VisualWebArena is a benchmark designed to assess the performance of multimodal web agents on realistic tasks.
To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives.
arXiv Detail & Related papers (2024-01-24T18:35:21Z) - WebArena: A Realistic Web Environment for Building Autonomous Agents [92.3291458543633]
We build an environment for language-guided agents that is highly realistic and reproducible.
We focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains.
We release a set of benchmark tasks focusing on evaluating the functional correctness of task completions.
arXiv Detail & Related papers (2023-07-25T22:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.