CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
- URL: http://arxiv.org/abs/2407.01511v2
- Date: Fri, 18 Oct 2024 11:29:39 GMT
- Title: CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
- Authors: Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Anjie Yang, Zhaoxuan Jin, Jianbo Deng, Philip Torr, Bernard Ghanem, Guohao Li,
- Abstract summary: Crab is the first benchmark framework designed to support cross-environment tasks.
Our framework supports multiple devices and can be easily extended to any environment with a Python interface.
The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
- Score: 49.68117560675367
- License:
- Abstract: The development of autonomous agents increasingly relies on Multimodal Language Models (MLMs) to perform tasks described in natural language with GUI environments, such as websites, desktop computers, or mobile phones. Existing benchmarks for MLM agents in interactive environments are limited by their focus on a single environment, lack of detailed and generalized evaluation methods, and the complexities of constructing tasks and evaluators. To overcome these limitations, we introduce Crab, the first agent benchmark framework designed to support cross-environment tasks, incorporating a graph-based fine-grained evaluation method and an efficient mechanism for task and evaluator construction. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. Leveraging Crab, we developed a cross-platform Crab Benchmark-v0 comprising 120 tasks in computer desktop and mobile phone environments. We evaluated four advanced MLMs using different single and multi-agent system configurations on this benchmark. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%. All framework code, agent code, and task datasets are publicly available at https://github.com/camel-ai/crab.
Related papers
- Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering.
Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications.
These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z) - $τ$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains [43.43344028212623]
$tau$-bench is a benchmark emulating dynamic conversations between a user and a language agent.
We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state.
arXiv Detail & Related papers (2024-06-17T19:33:08Z) - AgentGym: Evolving Large Language Model-based Agents across Diverse Environments [116.97648507802926]
Large language models (LLMs) are considered a promising foundation to build such agents.
We take the first step towards building generally-capable LLM-based agents with self-evolution ability.
We propose AgentGym, a new framework featuring a variety of environments and tasks for broad, real-time, uni-format, and concurrent agent exploration.
arXiv Detail & Related papers (2024-06-06T15:15:41Z) - Benchmarking Mobile Device Control Agents across Diverse Configurations [19.01954948183538]
B-MoCA is a benchmark for evaluating and developing mobile device control agents.
We benchmark diverse agents, including agents employing large language models (LLMs) or multi-modal LLMs.
While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to improve effectiveness.
arXiv Detail & Related papers (2024-04-25T14:56:32Z) - OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [87.41051677852231]
We introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents.
OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks.
We create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and spanning multiple applications.
arXiv Detail & Related papers (2024-04-11T17:56:05Z) - AgentStudio: A Toolkit for Building General Virtual Agents [57.02375267926862]
General virtual agents need to handle multimodal observations, master complex action spaces, and self-improve in dynamic, open-domain environments.
AgentStudio provides a lightweight, interactive environment with highly generic observation and action spaces.
It integrates tools for creating online benchmark tasks, annotating GUI elements, and labeling actions in videos.
Based on our environment and tools, we curate an online task suite that benchmarks both GUI interactions and function calling with efficient auto-evaluation.
arXiv Detail & Related papers (2024-03-26T17:54:15Z) - MAgIC: Investigation of Large Language Model Powered Multi-Agent in
Cognition, Adaptability, Rationality and Collaboration [102.41118020705876]
Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing.
As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework.
This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.