Fugu-MT 論文翻訳(概要): Agent-as-a-Judge

論文の概要: Agent-as-a-Judge

arxiv url: http://arxiv.org/abs/2601.05111v1
Date: Thu, 08 Jan 2026 16:58:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-09 17:01:53.292597
Title: Agent-as-a-Judge
Title（参考訳）: エージェント・アズ・ア・ジャッジ
Authors: Runyang You, Hongru Cai, Caiqi Zhang, Qiancheng Xu, Meng Liu, Tiezheng Yu, Yongqi Li, Wenjie Li,
Abstract要約: LLM-as-a-Judgeは、スケーラブルな評価に大規模言語モデルを活用することで、AI評価に革命をもたらした。評価が複雑化し、専門化され、多段階化されるにつれて、LLM-as-a-Judgeの信頼性は、固有のバイアス、浅いシングルパス推論、現実世界の観測に対する評価の欠如によって制約されている。これはエージェント・アズ・ア・ジャッジ(Agen-as-a-Judge)への移行を触媒し、エージェント・ジャッジは計画、ツール強化された検証、マルチエージェント・コラボレーション、永続メモリを採用し、より堅牢で検証可能な、ニュアンスな評価を可能にする。
参考スコア（独自算出の注目度）: 20.902198303020693
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.
Abstract（参考訳）: LLM-as-a-Judgeは、スケーラブルな評価に大規模言語モデルを活用することで、AI評価に革命をもたらした。しかし、評価が複雑化し、専門化され、多段階化されるにつれて、LLM-as-a-Judgeの信頼性は、固有のバイアス、浅いシングルパス推論、実世界の観測に対する評価の検証ができないことによって制約されている。これはエージェント・アズ・ア・ジャッジ(Agen-as-a-Judge)への移行を触媒し、エージェント・ジャッジは計画、ツール強化された検証、マルチエージェント・コラボレーション、永続メモリを採用し、より堅牢で検証可能な、ニュアンスな評価を可能にする。エージェント評価システムの急速な普及にもかかわらず、このシフトする風景をナビゲートするための統一的な枠組みが欠如している。このギャップを埋めるために、私たちはこの進化を追跡する最初の総合的な調査を示す。具体的には、このパラダイムシフトを特徴付ける重要な次元を特定し、発達的分類を確立します。コア方法論を整理し、一般分野と専門分野にまたがってアプリケーションを調査する。さらに、我々はフロンティアの課題を分析し、将来的な研究の方向性を特定し、最終的に次世代のエージェント評価のための明確なロードマップを提供する。

論文の概要: Agent-as-a-Judge

関連論文リスト