Survey on Evaluation of LLM-based Agents
- URL: http://arxiv.org/abs/2503.16416v1
- Date: Thu, 20 Mar 2025 17:59:23 GMT
- Title: Survey on Evaluation of LLM-based Agents
- Authors: Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, Michal Shmueli-Scheuer,
- Abstract summary: The emergence of LLM-based agents represents a paradigm shift in AI.<n>This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents.
- Score: 28.91672694491855
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The emergence of LLM-based agents represents a paradigm shift in AI, enabling autonomous systems to plan, reason, use tools, and maintain memory while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents. We systematically analyze evaluation benchmarks and frameworks across four critical dimensions: (1) fundamental agent capabilities, including planning, tool use, self-reflection, and memory; (2) application-specific benchmarks for web, software engineering, scientific, and conversational agents; (3) benchmarks for generalist agents; and (4) frameworks for evaluating agents. Our analysis reveals emerging trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address-particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, and scalable evaluation methods. This survey maps the rapidly evolving landscape of agent evaluation, reveals the emerging trends in the field, identifies current limitations, and proposes directions for future research.
Related papers
- Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks [229.73714829399802]
This survey probes the core challenges that the rise of Large Language Models poses for evaluation.
We identify and analyze two pivotal transitions: (i) from task-specific to capability-based evaluation, which reorganizes benchmarks around core competencies such as knowledge, reasoning, instruction following, multi-modal understanding, and safety.
We will dissect this issue, along with the core challenges of the above two transitions, from the perspectives of methods, datasets, evaluators, and metrics.
arXiv Detail & Related papers (2025-04-26T07:48:52Z) - A Survey on (M)LLM-Based GUI Agents [62.57899977018417]
Graphical User Interface (GUI) Agents have emerged as a transformative paradigm in human-computer interaction.
Recent advances in large language models and multimodal learning have revolutionized GUI automation across desktop, mobile, and web platforms.
This survey identifies key technical challenges, including accurate element localization, effective knowledge retrieval, long-horizon planning, and safety-aware execution control.
arXiv Detail & Related papers (2025-03-27T17:58:31Z) - AI Agents: Evolution, Architecture, and Real-World Applications [0.0]
The paper examines the evolution, architecture, and practical applications of AI agents from their early, rule-based incarnations to modern sophisticated systems that integrate large language models with dedicated modules for perception, planning, and tool use.
arXiv Detail & Related papers (2025-03-16T23:07:48Z) - Beyond Black-Box Benchmarking: Observability, Analytics, and Optimization of Agentic Systems [1.415098516077151]
The rise of agentic AI systems, where agents collaborate to perform diverse tasks, poses new challenges with observing, analyzing and optimizing their behavior.<n>Traditional evaluation and benchmarking approaches struggle to handle the non-deterministic, context-sensitive, and dynamic nature of these systems.<n>This paper explores key challenges and opportunities in analyzing and optimizing agentic systems across development, testing, and maintenance.
arXiv Detail & Related papers (2025-03-09T20:02:04Z) - MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.<n>In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.<n>This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs [29.72874725703848]
We introduce two key concepts: Benchmark+, which extends the traditional question-answer benchmark into a more flexible strategy-criterion'' format; and Assessment+, which enhances the interaction process.<n>We propose TestAgent, an agent-based evaluation framework that implements these concepts using retrieval-augmented generation and reinforcement learning.<n>TestAgent enables automatic dynamic benchmark generation and in-depth assessment across diverse vertical domain scenarios.
arXiv Detail & Related papers (2024-10-15T11:20:42Z) - Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [74.16170899755281]
We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.<n>AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit.<n>This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront.
arXiv Detail & Related papers (2024-01-24T01:51:00Z) - Evaluating General-Purpose AI with Psychometrics [43.85432514910491]
We discuss the need for a comprehensive and accurate evaluation of general-purpose AI systems such as large language models.
Current evaluation methodology, mostly based on benchmarks of specific tasks, falls short of adequately assessing these versatile AI systems.
To tackle these challenges, we suggest transitioning from task-oriented evaluation to construct-oriented evaluation.
arXiv Detail & Related papers (2023-10-25T05:38:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.