Agent-as-a-Judge
- URL: http://arxiv.org/abs/2601.05111v1
- Date: Thu, 08 Jan 2026 16:58:10 GMT
- Title: Agent-as-a-Judge
- Authors: Runyang You, Hongru Cai, Caiqi Zhang, Qiancheng Xu, Meng Liu, Tiezheng Yu, Yongqi Li, Wenjie Li,
- Abstract summary: LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments.<n>As evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations.<n>This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, nuanced evaluations.
- Score: 20.902198303020693
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.
Related papers
- The Landscape of Prompt Injection Threats in LLM Agents: From Taxonomy to Analysis [24.51410516475904]
This SoK presents a comprehensive overview of the Prompt Injection (PI) landscape, covering attacks, defenses, and their evaluation practices.<n>We introduce AgentPI, a new benchmark designed to systematically evaluate agent behavior under context-dependent interaction settings.<n>We show that many defenses appear effective under existing benchmarks by suppressing contextual inputs, yet fail to generalize to realistic agent settings where context-dependent reasoning is essential.
arXiv Detail & Related papers (2026-02-11T02:47:10Z) - MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems [59.20800753428596]
We present MAS-ProVe, a systematic empirical study of process verification for multi-agent systems (MAS)<n>Our study spans three verification paradigms (LLM-as-a-Judge, reward models, and process reward models)<n>We find that process-level verification does not consistently improve performance and frequently exhibits high variance.
arXiv Detail & Related papers (2026-02-03T03:30:36Z) - From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models [77.04403907729738]
This survey charts the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior.<n>We demonstrate how uncertainty is leveraged as an active control signal across three frontiers.<n>This survey argues that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.
arXiv Detail & Related papers (2026-01-22T06:21:31Z) - Towards Self-Evolving Benchmarks: Synthesizing Agent Trajectories via Test-Time Exploration under Validate-by-Reproduce Paradigm [60.36837655498119]
We propose a Trajectory-based validated-by-Reproducing Agent-benchmark Complexity Evolution framework.<n>This framework takes an original task from an existing benchmark and encourages agents to evolve it into a new task with higher difficulty.<n>Experiments on the GAIA benchmark demonstrate that the TRACE framework consistently enhances task complexity while improving the reliability of correctness.
arXiv Detail & Related papers (2025-10-01T01:52:52Z) - When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs [8.575522204707958]
Large language models (LLMs) grow in capability and autonomy, evaluating their outputs-especially in open-ended and complex tasks-has become a critical bottleneck.<n>A new paradigm is emerging: using AI agents as the evaluators themselves.<n>In this review, we define the agent-as-a-judge concept, trace its evolution from single-model judges to dynamic multi-agent debate frameworks, and critically examine their strengths and shortcomings.
arXiv Detail & Related papers (2025-08-05T01:42:25Z) - Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks [229.73714829399802]
This survey probes the core challenges that the rise of Large Language Models poses for evaluation.<n>We identify and analyze two pivotal transitions: (i) from task-specific to capability-based evaluation, which reorganizes benchmarks around core competencies such as knowledge, reasoning, instruction following, multi-modal understanding, and safety.<n>We will dissect this issue, along with the core challenges of the above two transitions, from the perspectives of methods, datasets, evaluators, and metrics.
arXiv Detail & Related papers (2025-04-26T07:48:52Z) - Survey on Evaluation of LLM-based Agents [28.91672694491855]
The emergence of LLM-based agents represents a paradigm shift in AI.<n>This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents.
arXiv Detail & Related papers (2025-03-20T17:59:23Z) - Survey on AI-Generated Media Detection: From Non-MLLM to MLLM [51.91311158085973]
Methods for detecting AI-generated media have evolved rapidly.<n>General-purpose detectors based on MLLMs integrate authenticity verification, explainability, and localization capabilities.<n>Ethical and security considerations have emerged as critical global concerns.
arXiv Detail & Related papers (2025-02-07T12:18:20Z) - AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [74.16170899755281]
We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.<n>AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit.<n>This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront.
arXiv Detail & Related papers (2024-01-24T01:51:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.