AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents
- URL: http://arxiv.org/abs/2401.13178v1
- Date: Wed, 24 Jan 2024 01:51:00 GMT
- Title: AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents
- Authors: Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui
Jin, Zhenzhong Lan, Lingpeng Kong, Junxian He
- Abstract summary: evaluating large language models (LLMs) is essential for understanding their capabilities and facilitating their integration into practical applications.
We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.
- Score: 76.95062553043607
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating large language models (LLMs) as general-purpose agents is
essential for understanding their capabilities and facilitating their
integration into practical applications. However, the evaluation process
presents substantial challenges. A primary obstacle is the benchmarking of
agent performance across diverse scenarios within a unified framework,
especially in maintaining partially-observable environments and ensuring
multi-round interactions. Moreover, current evaluation frameworks mostly focus
on the final success rate, revealing few insights during the process and
failing to provide a deep understanding of the model abilities. To address
these challenges, we introduce AgentBoard, a pioneering comprehensive benchmark
and accompanied open-source evaluation framework tailored to analytical
evaluation of LLM agents. AgentBoard offers a fine-grained progress rate metric
that captures incremental advancements as well as a comprehensive evaluation
toolkit that features easy assessment of agents for multi-faceted analysis
through interactive visualization. This not only sheds light on the
capabilities and limitations of LLM agents but also propels the
interpretability of their performance to the forefront. Ultimately, AgentBoard
serves as a significant step towards demystifying agent behaviors and
accelerating the development of stronger LLM agents.
Related papers
- From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning [62.54484062185869]
We introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process.
We propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment.
arXiv Detail & Related papers (2024-11-06T10:35:11Z) - Building Trust in Black-box Optimization: A Comprehensive Framework for Explainability [1.3812010983144802]
Surrogate Optimization (SO) is a common resolution, yet its proprietary nature leads to a lack of explainability and transparency.
We propose emphInclusive Explainability Metrics for Surrogate Optimization (IEMSO)
These metrics enhance the transparency, trustworthiness, and explainability of the SO approaches.
arXiv Detail & Related papers (2024-10-18T16:20:17Z) - Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs [29.72874725703848]
We introduce two concepts: Benchmark+, which extends traditional question-answer benchmark into a more flexible "strategy-criterion" format; and Assessment+, which enhances the interaction process.
We propose an agent-based evaluation framework called TestAgent, which implements these concepts through retrieval augmented generation and reinforcement learning.
arXiv Detail & Related papers (2024-10-15T11:20:42Z) - RAG-Modulo: Solving Sequential Tasks using Experience, Critics, and Language Models [5.0741409008225755]
Large language models (LLMs) have emerged as promising tools for solving challenging robotic tasks.
Most existing LLM-based agents lack the ability to retain and learn from past interactions.
We propose RAG-Modulo, a framework that enhances LLM-based agents with a memory of past interactions and incorporates critics to evaluate the agents' decisions.
arXiv Detail & Related papers (2024-09-18T20:03:32Z) - Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement [50.481380478458945]
Iterative step-level Process Refinement (IPR) framework provides detailed step-by-step guidance to enhance agent training.
Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines.
arXiv Detail & Related papers (2024-06-17T03:29:13Z) - Large Multimodal Agents: A Survey [78.81459893884737]
Large language models (LLMs) have achieved superior performance in powering text-based AI agents.
There is an emerging research trend focused on extending these LLM-powered AI agents into the multimodal domain.
This review aims to provide valuable insights and guidelines for future research in this rapidly evolving field.
arXiv Detail & Related papers (2024-02-23T06:04:23Z) - Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework.
We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z) - AgentBench: Evaluating LLMs as Agents [88.45506148281379]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks.
We present AgentBench, a benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.