Tur[k]ingBench: A Challenge Benchmark for Web Agents
- URL: http://arxiv.org/abs/2403.11905v2
- Date: Thu, 21 Mar 2024 21:57:13 GMT
- Title: Tur[k]ingBench: A Challenge Benchmark for Web Agents
- Authors: Kevin Xu, Yeganeh Kordi, Kate Sanders, Yizhong Wang, Adam Byerly, Jack Zhang, Benjamin Van Durme, Daniel Khashabi,
- Abstract summary: TurkingBench is a benchmark of tasks formulated as web pages containing textual instructions with multi-modal context.
This benchmark contains 32.2K instances distributed across 158 tasks.
We evaluate the performance of state-of-the-art models, including language-only, vision-only, and layout-only models, on this benchmark.
- Score: 43.23043474694926
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent chatbots have demonstrated impressive ability to understand and communicate in raw-text form. However, there is more to the world than raw text. For example, humans spend long hours of their time on web pages, where text is intertwined with other modalities and tasks are accomplished in the form of various complex interactions. Can state-of-the-art multi-modal models generalize to such complex domains? To address this question, we introduce TurkingBench, a benchmark of tasks formulated as web pages containing textual instructions with multi-modal context. Unlike existing work which employs artificially synthesized web pages, here we use natural HTML pages that were originally designed for crowdsourcing workers for various annotation purposes. The HTML instructions of each task are also instantiated with various values (obtained from the crowdsourcing tasks) to form new instances of the task. This benchmark contains 32.2K instances distributed across 158 tasks. Additionally, to facilitate the evaluation on TurkingBench, we develop an evaluation framework that connects the responses of chatbots to modifications on web pages (modifying a text box, checking a radio, etc.). We evaluate the performance of state-of-the-art models, including language-only, vision-only, and layout-only models, and their combinations, on this benchmark. Our findings reveal that these models perform significantly better than random chance, yet considerable room exists for improvement. We hope this benchmark will help facilitate the evaluation and development of web-based agents.
Related papers
- WebQuest: A Benchmark for Multimodal QA on Web Page Sequences [10.008284460456107]
WebQuest is a multi-page question-answering dataset that requires reasoning across multiple web pages.
Our dataset evaluates information extraction, multimodal retrieval and composition of information from many web pages.
We evaluate leading proprietary multimodal models like GPT-4V, Gemini Flash, Claude 3, and open source models like InstructBLIP, PaliGemma on our dataset.
arXiv Detail & Related papers (2024-09-06T18:44:25Z) - CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents [49.68117560675367]
Crab is the first benchmark framework designed to support cross-environment tasks.
Our framework supports multiple devices and can be easily extended to any environment with a Python interface.
The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
arXiv Detail & Related papers (2024-07-01T17:55:04Z) - MMInA: Benchmarking Multihop Multimodal Internet Agents [36.173995299002]
We present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks.
Our data includes 1,050 human-written tasks covering various domains such as shopping and travel.
Our method significantly improved both the single-hop and multihop web browsing abilities of agents.
arXiv Detail & Related papers (2024-04-15T17:59:50Z) - VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? [115.60866817774641]
Multimodal Large Language models (MLLMs) have shown promise in web-related tasks.
evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks.
bench is a multimodal benchmark designed to assess the capabilities of MLLMs across a variety of web tasks.
arXiv Detail & Related papers (2024-04-09T02:29:39Z) - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [93.85005277463802]
VisualWebArena is a benchmark designed to assess the performance of multimodal web agents on realistic tasks.
To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives.
arXiv Detail & Related papers (2024-01-24T18:35:21Z) - Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech [107.81472531864195]
Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions.
We present Dynamic-SUPERB, a benchmark for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion.
arXiv Detail & Related papers (2023-09-18T06:43:30Z) - Multimodal Web Navigation with Instruction-Finetuned Foundation Models [99.14209521903854]
We study data-driven offline training for web agents with vision-language foundation models.
We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages.
We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning.
arXiv Detail & Related papers (2023-05-19T17:44:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.