Fugu-MT 論文翻訳(概要): Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks

論文の概要: Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks

arxiv url: http://arxiv.org/abs/2603.06942v1
Date: Fri, 06 Mar 2026 23:30:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:13.417464
Title: Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks
Title（参考訳）: 深部調査・浅部評価:長期QAベンチマークにおけるメタ評価の事例研究
Authors: Jena D. Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik, Malachi Hamada, Jonathan Bragg, Mike D'Arcy, Daniel S. Weld, Lucy Lu Wang, Doug Downey, Sergey Feldman,
Abstract要約: ScholarQA-CS2を用いた長期QAベンチマークのメタ評価のケーススタディを行う。人間の選好判断を通じてベンチマークを検証し、このアプローチの強み、弱点、共同創設者を批判的に検証する。提案手法では,一対の選好ランクがシステムレベルの評価に最適であるのに対して,明示的な計量レベルのアノテーションや専門家アノテータは信頼性の高い計量レベルの評価に不可欠であることを示す。
参考スコア（独自算出の注目度）: 40.91183014128371
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances have made long-form report-generating systems widely available. This has prompted evaluation frameworks that use LLM-as-judge protocols and claim verification, along with meta-evaluation frameworks that seek to validate these methods. Many of the meta-evaluations estimate an evaluation quality's by comparing its assessments against human pairwise preferences. Prior work, however, suggests that human pairwise preference may be overly simplistic and can fail to capture nuances of expert expectations. We conduct a case study in meta-evaluation for long-form QA benchmarks using ScholarQA-CS2, a benchmark designed for assessing retrieval-augmented deep-research QA in the scientific domain. We comprehensively validate the benchmark through human pairwise preference judgments, then critically examine the strengths, weaknesses, and confounders of this approach. We show that pairwise preference rankings are best suited for system-level evaluation, while explicit metric-wise annotations and expert annotators are critical for reliable metric-level assessment, with subjectivity remaining a key challenge. Based on our findings, we offer practical guidelines for designing future meta-evaluations that better align evaluation methods, annotator expertise, and reporting practices. By surfacing these methodological challenges, we aim to advance evaluation standards for deep-research systems.
Abstract（参考訳）: 近年の進歩により、長期的なレポート生成システムが広く利用できるようになった。これにより、LCM-as-judgeプロトコルとクレーム検証を使用する評価フレームワークと、これらのメソッドを検証するメタ評価フレームワークが導入された。メタ評価の多くは、その評価と人間のペアワイズ選好を比較して評価品質を推定する。しかしながら、以前の研究は、人間のペアワイドな好みは過度に単純であり、専門家の期待するニュアンスを捉えることに失敗する可能性があることを示唆している。 ScholarQA-CS2(ScholarQA-CS2)を用いた長期QAベンチマークのメタ評価のケーススタディを行う。我々は、人間のペアワイズ選好判断を通じてベンチマークを包括的に検証し、このアプローチの強み、弱点、および共同創設者を批判的に検証する。提案手法では,一対の選好ランクがシステムレベルの評価に最適であるのに対して,明示的な計量レベルのアノテーションや専門家アノテータは信頼性の高い計量レベルの評価に重要であり,主観性は重要な課題である。本研究は,評価手法,アノテーションの専門知識,報告プラクティスをより良く整合させる,将来的なメタ評価を設計するための実践的ガイドラインを提供する。これらの方法論的課題を克服し,深層調査システムの評価基準の策定を目指す。

論文の概要: Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks

関連論文リスト