Tur[k]ingBench: A Challenge Benchmark for Web Agents
- URL: http://arxiv.org/abs/2403.11905v3
- Date: Sun, 1 Sep 2024 07:13:30 GMT
- Title: Tur[k]ingBench: A Challenge Benchmark for Web Agents
- Authors: Kevin Xu, Yeganeh Kordi, Tanay Nayak, Ado Asija, Yizhong Wang, Kate Sanders, Adam Byerly, Jingyu Zhang, Benjamin Van Durme, Daniel Khashabi,
- Abstract summary: TurkingBench is a benchmark consisting of tasks presented as web pages with textual instructions and multi-modal contexts.
This benchmark includes 32.2K instances spread across 158 tasks.
We assess the performance of cutting-edge private and open-source models, including language-only and vision-language models.
- Score: 41.852615330725655
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Can advanced multi-modal models effectively tackle complex web-based tasks? Such tasks are often found on crowdsourcing platforms, where crowdworkers engage in challenging micro-tasks within web-based environments. Building on this idea, we present TurkingBench, a benchmark consisting of tasks presented as web pages with textual instructions and multi-modal contexts. Unlike previous approaches that rely on artificially synthesized web pages, our benchmark uses natural HTML pages originally designed for crowdsourcing workers to perform various annotation tasks. Each task's HTML instructions are instantiated with different values derived from crowdsourcing tasks, creating diverse instances. This benchmark includes 32.2K instances spread across 158 tasks. To support the evaluation of TurkingBench, we have developed a framework that links chatbot responses to actions on web pages (e.g., modifying a text box, selecting a radio button). We assess the performance of cutting-edge private and open-source models, including language-only and vision-language models (such as GPT4 and InternVL), on this benchmark. Our results show that while these models outperform random chance, there is still significant room for improvement. We hope that this benchmark will drive progress in the evaluation and development of web-based agents.
Related papers
- WebQuest: A Benchmark for Multimodal QA on Web Page Sequences [10.008284460456107]
WebQuest is a multi-page question-answering dataset that requires reasoning across multiple web pages.
Our dataset evaluates information extraction, multimodal retrieval and composition of information from many web pages.
We evaluate leading proprietary multimodal models like GPT-4V, Gemini Flash, Claude 3, and open source models like InstructBLIP, PaliGemma on our dataset.
arXiv Detail & Related papers (2024-09-06T18:44:25Z) - CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents [49.68117560675367]
Crab is the first benchmark framework designed to support cross-environment tasks.
Our framework supports multiple devices and can be easily extended to any environment with a Python interface.
The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
arXiv Detail & Related papers (2024-07-01T17:55:04Z) - MMInA: Benchmarking Multihop Multimodal Internet Agents [36.173995299002]
We present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks.
Our data includes 1,050 human-written tasks covering various domains such as shopping and travel.
Our method significantly improved both the single-hop and multihop web browsing abilities of agents.
arXiv Detail & Related papers (2024-04-15T17:59:50Z) - VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? [115.60866817774641]
Multimodal Large Language models (MLLMs) have shown promise in web-related tasks.
evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks.
bench is a multimodal benchmark designed to assess the capabilities of MLLMs across a variety of web tasks.
arXiv Detail & Related papers (2024-04-09T02:29:39Z) - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [93.85005277463802]
VisualWebArena is a benchmark designed to assess the performance of multimodal web agents on realistic tasks.
To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives.
arXiv Detail & Related papers (2024-01-24T18:35:21Z) - Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech [107.81472531864195]
Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions.
We present Dynamic-SUPERB, a benchmark for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion.
arXiv Detail & Related papers (2023-09-18T06:43:30Z) - Multimodal Web Navigation with Instruction-Finetuned Foundation Models [99.14209521903854]
We study data-driven offline training for web agents with vision-language foundation models.
We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages.
We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning.
arXiv Detail & Related papers (2023-05-19T17:44:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.