AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents
- URL: http://arxiv.org/abs/2401.13178v1
- Date: Wed, 24 Jan 2024 01:51:00 GMT
- Title: AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents
- Authors: Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui
Jin, Zhenzhong Lan, Lingpeng Kong, Junxian He
- Abstract summary: evaluating large language models (LLMs) is essential for understanding their capabilities and facilitating their integration into practical applications.
We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.
- Score: 76.95062553043607
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating large language models (LLMs) as general-purpose agents is
essential for understanding their capabilities and facilitating their
integration into practical applications. However, the evaluation process
presents substantial challenges. A primary obstacle is the benchmarking of
agent performance across diverse scenarios within a unified framework,
especially in maintaining partially-observable environments and ensuring
multi-round interactions. Moreover, current evaluation frameworks mostly focus
on the final success rate, revealing few insights during the process and
failing to provide a deep understanding of the model abilities. To address
these challenges, we introduce AgentBoard, a pioneering comprehensive benchmark
and accompanied open-source evaluation framework tailored to analytical
evaluation of LLM agents. AgentBoard offers a fine-grained progress rate metric
that captures incremental advancements as well as a comprehensive evaluation
toolkit that features easy assessment of agents for multi-faceted analysis
through interactive visualization. This not only sheds light on the
capabilities and limitations of LLM agents but also propels the
interpretability of their performance to the forefront. Ultimately, AgentBoard
serves as a significant step towards demystifying agent behaviors and
accelerating the development of stronger LLM agents.
Related papers
- KoMA: Knowledge-driven Multi-agent Framework for Autonomous Driving with Large Language Models [15.951550445568605]
Large language models (LLMs) as autonomous agents offer a novel avenue for tackling real-world challenges through a knowledge-driven manner.
We propose the KoMA framework consisting of multi-agent interaction, multi-step planning, shared-memory, and ranking-based reflection modules.
arXiv Detail & Related papers (2024-07-19T12:13:08Z) - Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement [50.481380478458945]
Iterative step-level Process Refinement (IPR) framework provides detailed step-by-step guidance to enhance agent training.
Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines.
arXiv Detail & Related papers (2024-06-17T03:29:13Z) - MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning [3.651416979200174]
MMCTAgent is a novel critical thinking agent framework designed to address the inherent limitations of current MLLMs in complex visual reasoning tasks.
Inspired by human cognitive processes and critical thinking, MMCTAgent iteratively analyzes multi-modal information, decomposes queries, plans strategies, and dynamically evolves its reasoning.
arXiv Detail & Related papers (2024-05-28T16:55:41Z) - DEBATE: Devil's Advocate-Based Assessment and Text Evaluation [6.2689399557794525]
We propose DEBATE, an NLG evaluation framework based on multi-agent scoring system.
Within the framework, one agent is instructed to criticize other agents' arguments.
We show that the extensiveness of debates among agents and the persona of an agent can influence the performance of evaluators.
arXiv Detail & Related papers (2024-05-16T09:41:12Z) - OPEx: A Component-Wise Analysis of LLM-Centric Agents in Embodied
Instruction Following [38.99303334457817]
Embodied Instruction Following (EIF) is a crucial task in embodied learning, requiring agents to interact with their environment through egocentric observations to fulfill natural language instructions.
Recent advancements have seen a surge in employing large language models (LLMs) within a framework-centric approach to enhance performance in EIF.
We introduce OPEx, a comprehensive framework that delineates the core components essential for solving EIF tasks: Observer, Planner, and Executor.
arXiv Detail & Related papers (2024-03-05T14:53:53Z) - Large Multimodal Agents: A Survey [78.81459893884737]
Large language models (LLMs) have achieved superior performance in powering text-based AI agents.
There is an emerging research trend focused on extending these LLM-powered AI agents into the multimodal domain.
This review aims to provide valuable insights and guidelines for future research in this rapidly evolving field.
arXiv Detail & Related papers (2024-02-23T06:04:23Z) - Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework.
We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z) - AntEval: Evaluation of Social Interaction Competencies in LLM-Driven
Agents [65.16893197330589]
Large Language Models (LLMs) have demonstrated their ability to replicate human behaviors across a wide range of scenarios.
However, their capability in handling complex, multi-character social interactions has yet to be fully explored.
We introduce the Multi-Agent Interaction Evaluation Framework (AntEval), encompassing a novel interaction framework and evaluation methods.
arXiv Detail & Related papers (2024-01-12T11:18:00Z) - AgentBench: Evaluating LLMs as Agents [88.45506148281379]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks.
We present AgentBench, a benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.