VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
- URL: http://arxiv.org/abs/2408.06327v1
- Date: Mon, 12 Aug 2024 17:44:17 GMT
- Title: VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
- Authors: Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Yifan Xu, Xixuan Song, Shudan Zhang, Hanyu Lai, Xinyi Liu, Hanlin Zhao, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, Jie Tang,
- Abstract summary: Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable Visual Foundation Agents.
Existing benchmarks fail to sufficiently challenge or showcase the full potential of LMMs in complex, real-world environments.
VisualAgentBench (VAB) is a pioneering benchmark specifically designed to train and evaluate LMMs as visual foundation agents.
- Score: 50.12414817737912
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable Visual Foundation Agents. These agents are postulated to excel across a myriad of tasks, potentially approaching general artificial intelligence. However, existing benchmarks fail to sufficiently challenge or showcase the full potential of LMMs in complex, real-world environments. To address this gap, we introduce VisualAgentBench (VAB), a comprehensive and pioneering benchmark specifically designed to train and evaluate LMMs as visual foundation agents across diverse scenarios, including Embodied, Graphical User Interface, and Visual Design, with tasks formulated to probe the depth of LMMs' understanding and interaction capabilities. Through rigorous testing across nine proprietary LMM APIs and eight open models, we demonstrate the considerable yet still developing agent capabilities of these models. Additionally, VAB constructs a trajectory training set constructed through hybrid methods including Program-based Solvers, LMM Agent Bootstrapping, and Human Demonstrations, promoting substantial performance improvements in LMMs through behavior cloning. Our work not only aims to benchmark existing models but also provides a solid foundation for future development into visual foundation agents. Code, train \& test data, and part of fine-tuned open LMMs are available at \url{https://github.com/THUDM/VisualAgentBench}.
Related papers
- GUI Agents with Foundation Models: A Comprehensive Survey [52.991688542729385]
This survey consolidates recent research on (M)LLM-based GUI agents.
We highlight key innovations in data, frameworks, and applications.
We hope this paper will inspire further developments in the field of (M)LLM-based GUI agents.
arXiv Detail & Related papers (2024-11-07T17:28:10Z) - HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks [25.959032350818795]
HumanEval-V is a benchmark designed to evaluate Large Language Models' visual understanding and reasoning capabilities through code generation.
HumanEval-V includes 108 carefully crafted, entry-level Python coding tasks derived from platforms like CodeForces and Stack Overflow.
We evaluate 19 state-of-the-art LMMs using HumanEval-V, uncovering significant challenges.
arXiv Detail & Related papers (2024-10-16T09:04:57Z) - AgentSquare: Automatic LLM Agent Search in Modular Design Space [16.659969168343082]
Large Language Models (LLMs) have led to a rapid growth of agentic systems capable of handling a wide range of complex tasks.
We introduce a new research problem: Modularized LLM Agent Search (MoLAS)
arXiv Detail & Related papers (2024-10-08T15:52:42Z) - xGen-MM (BLIP-3): A Family of Open Large Multimodal Models [157.44696790158784]
This report introduces xGen-MM, a framework for developing Large Multimodal Models (LMMs)
The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs.
Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks.
arXiv Detail & Related papers (2024-08-16T17:57:01Z) - Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models [87.47400128150032]
We propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement.
Lumen first promotes fine-grained vision-language concept alignment.
Then the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders.
arXiv Detail & Related papers (2024-03-12T04:13:45Z) - AgentBench: Evaluating LLMs as Agents [88.45506148281379]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks.
We present AgentBench, a benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z) - Multitask Adaptation by Retrospective Exploration with Learned World
Models [77.34726150561087]
We propose a meta-learned addressing model called RAMa that provides training samples for the MBRL agent taken from task-agnostic storage.
The model is trained to maximize the expected agent's performance by selecting promising trajectories solving prior tasks from the storage.
arXiv Detail & Related papers (2021-10-25T20:02:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.