Agent-as-a-Judge: Evaluate Agents with Agents
- URL: http://arxiv.org/abs/2410.10934v2
- Date: Wed, 16 Oct 2024 17:54:12 GMT
- Title: Agent-as-a-Judge: Evaluate Agents with Agents
- Authors: Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber,
- Abstract summary: We introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems.
This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process.
We present DevAI, a new benchmark of 55 realistic automated AI development tasks.
- Score: 61.33974108405561
- License:
- Abstract: Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes -- ignoring the step-by-step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We apply the Agent-as-a-Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems using Agent-as-a-Judge and find it dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that Agent-as-a-Judge marks a concrete step forward for modern agentic systems -- by providing rich and reliable reward signals necessary for dynamic and scalable self-improvement.
Related papers
- AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents [4.57755315319748]
Large Language Model (LLM) Agents have demonstrated remarkable capabilities in task automation and intelligent decision-making.
These frameworks predominantly serve developers with extensive technical expertise.
Only 0.03 % of the global population possesses the necessary programming skills.
arXiv Detail & Related papers (2025-02-09T16:53:56Z) - The AI Agent Index [8.48525754659057]
Agentic AI systems can plan and execute complex tasks with limited human involvement.
There is currently no structured framework for documenting the technical components, intended uses, and safety features of agentic systems.
The AI Agent Index is the first public database to document information about currently deployed agentic AI systems.
arXiv Detail & Related papers (2025-02-03T18:59:13Z) - Free Agent in Agent-Based Mixture-of-Experts Generative AI Framework [0.0]
Reinforcement Learning Free Agent (RLFA) algorithm introduces a reward-based mechanism to detect and remove agents exhibiting persistent underperformance.
A primary use case is fraud detection, where RLFA promptly swaps out an agent whose detection accuracy dips below a preset threshold.
This dynamic, free-agency cycle ensures sustained accuracy, quicker adaptation to emerging threats, and minimal disruption to ongoing operations.
arXiv Detail & Related papers (2025-01-29T13:00:22Z) - Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement [117.94654815220404]
G"odel Agent is a self-evolving framework inspired by the G"odel machine.
G"odel Agent can achieve continuous self-improvement, surpassing manually crafted agents in performance, efficiency, and generalizability.
arXiv Detail & Related papers (2024-10-06T10:49:40Z) - Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems [1.079505444748609]
We present our work on building a novel web agent, Agent-E.
Agent-E introduces numerous architectural improvements over prior state-of-the-art web agents.
We show that Agent-E beats other SOTA text and multi-modal web agents on this benchmark in most categories by 10-30%.
arXiv Detail & Related papers (2024-07-17T21:44:28Z) - Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.
We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.
We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z) - AgentGym: Evolving Large Language Model-based Agents across Diverse Environments [116.97648507802926]
Large language models (LLMs) are considered a promising foundation to build such agents.
We take the first step towards building generally-capable LLM-based agents with self-evolution ability.
We propose AgentGym, a new framework featuring a variety of environments and tasks for broad, real-time, uni-format, and concurrent agent exploration.
arXiv Detail & Related papers (2024-06-06T15:15:41Z) - Tell Me More! Towards Implicit User Intention Understanding of Language
Model Driven Agents [110.25679611755962]
Current language model-driven agents often lack mechanisms for effective user participation, which is crucial given the vagueness commonly found in user instructions.
We introduce Intention-in-Interaction (IN3), a novel benchmark designed to inspect users' implicit intentions through explicit queries.
We empirically train Mistral-Interact, a powerful model that proactively assesses task vagueness, inquires user intentions, and refines them into actionable goals.
arXiv Detail & Related papers (2024-02-14T14:36:30Z) - The Rise and Potential of Large Language Model Based Agents: A Survey [91.71061158000953]
Large language models (LLMs) are regarded as potential sparks for Artificial General Intelligence (AGI)
We start by tracing the concept of agents from its philosophical origins to its development in AI, and explain why LLMs are suitable foundations for agents.
We explore the extensive applications of LLM-based agents in three aspects: single-agent scenarios, multi-agent scenarios, and human-agent cooperation.
arXiv Detail & Related papers (2023-09-14T17:12:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.