Fugu-MT 論文翻訳(概要): Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

論文の概要: Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

arxiv url: http://arxiv.org/abs/2605.19196v1
Date: Mon, 18 May 2026 23:55:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.0326
Title: Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
Title（参考訳）: タイム・トゥ・リフレクト: 証拠ベース研究エージェントのLLM審査員を信頼できるか?
Authors: Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan,
Abstract要約: 本稿では,エージェント環境におけるきめ細かい故障検出を目的としたメタ評価ベンチマークREFLECTを紹介する。 REFLECTはプロセスレベルの障害モードと結果レベルの障害モードの詳細な分類を定義し、制御および局所的な介入を実行することでインスタンス化する。私たちの実験では、最高のパフォーマンスモデルでさえ、推論、ツール使用、レポート品質の失敗に対して、全体的なアキュラシーを55%以下に達成しています。
参考スコア（独自算出の注目度）: 61.49434544687523
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep research agents increasingly automate complex information-seeking tasks, producing evidence-grounded reports via multi-step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM-as-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality. Yet the reliability of these judges for deep research agents remains poorly understood, posing a critical meta-evaluation problem: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves. Existing meta-evaluations fall short in two ways: (1) reliance on coarse, subjective human-preference agreement; (2) focus on instruction-following or verifiable tasks, leaving open-ended agent executions unexplored. To address these gaps, we introduce REFLECT (REliable Fine-grained LLM judge Evaluation via Controlled inTervention), a meta-evaluation benchmark targeting fine-grained failure detection in agentic environments. REFLECT defines a detailed taxonomy of process- and outcome-level failure modes, instantiated by performing controlled and localized interventions on quality-screened agent execution traces. This yields verifiable, comprehensive, and fine-grained instances for validating the judge models. Our experiments show that current LLM judges remain unreliable: even the best-performing models achieve overall accuracies below 55% across reasoning, tool-use, and report-quality failures, with especially poor performance on evidence verification. Together, our taxonomy and findings expose systematic judge limitations, reveal tradeoffs in cost and reliability, and offer actionable guidance for building more reliable evaluation pipelines for deep research agents.
Abstract（参考訳）: ディープリサーチエージェントは、複雑な情報検索タスクを自動化し、多段階の推論、ツールの使用、合成を通じてエビデンスに基づくレポートを生成する。彼らの成長する役割は、スケーラブルで信頼性の高い評価を要求し、事実の正確性、証拠の使用、推論品質を評価するための監督パラダイムとしてLLM-as-judgeを位置づけることである。しかし、深層調査エージェントに対するこれらの審査員の信頼性は未だによく分かっておらず、重要なメタ評価問題を引き起こしている。既存のメタ評価は,(1)粗大で主観的な人間関係への依存,(2)指示追従や検証可能なタスクに焦点を合わせ,オープンエンドエージェントの実行を未調査のままにしておくという2つの方法によって不足する。これらのギャップに対処するために,エージェント環境でのきめ細かい故障検出を目的としたメタ評価ベンチマークであるREFLECT(Reliable Fine-fine LLM judge Evaluation via Controlled inTervention)を導入する。 REFLECTはプロセスレベルの障害モードと結果レベルの障害モードの詳細な分類を定義し、品質チェックされたエージェントの実行トレースに対して、制御および局所的な介入を実行することによってインスタンス化する。これにより、審査モデルを検証するための検証可能で包括的できめ細かなインスタンスが得られる。もっとも優れたモデルでさえ、推論、ツール使用、報告品質の失敗に対して、全体的な精度を55%以下に達成しています。分類学と知見は、体系的な判断の限界を明らかにし、コストと信頼性のトレードオフを明らかにし、より信頼性の高い評価パイプラインを構築するための実用的なガイダンスを提供する。

論文の概要: Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

関連論文リスト