Fugu-MT 論文翻訳(概要): ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics

論文の概要: ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics

arxiv url: http://arxiv.org/abs/2509.00496v1
Date: Sat, 30 Aug 2025 13:37:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.257671
Title: ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics
Title（参考訳）: ResearchQA:75分野を対象にした学術的質問応答の評価
Authors: Li S. Yifei, Allen Chang, Chaitanya Malaviya, Mark Yatskar,
Abstract要約: ResearchQAは、75の研究分野から21Kクエリと160Kルーブリックアイテムに調査項目を蒸留し、LCMシステムを評価するためのリソースである。 8フィールドの31のPh.D.アノテータによる評価では、クエリの96%がPh.D.情報ニーズをサポートしている。我々はResearchQAを利用して、18のシステムにおける能力ギャップを7.6K以上のペアワイズ評価で分析する。
参考スコア（独自算出の注目度）: 11.916911713137518
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating long-form responses to research queries heavily relies on expert annotators, restricting attention to areas like AI where researchers can conveniently enlist colleagues. Yet, research expertise is widespread: survey articles synthesize knowledge distributed across the literature. We introduce ResearchQA, a resource for evaluating LLM systems by distilling survey articles from 75 research fields into 21K queries and 160K rubric items. Each rubric, derived jointly with queries from survey sections, lists query-specific answer evaluation criteria, i.e., citing papers, making explanations, and describing limitations. Assessments by 31 Ph.D. annotators in 8 fields indicate 96% of queries support Ph.D. information needs and 87% of rubric items should be addressed in system responses by a sentence or more. Using our rubrics, we are able to construct an automatic pairwise judge obtaining 74% agreement with expert judgments. We leverage ResearchQA to analyze competency gaps in 18 systems in over 7.6K pairwise evaluations. No parametric or retrieval-augmented system we evaluate exceeds 70% on covering rubric items, and the highest-ranking agentic system shows 75% coverage. Error analysis reveals that the highest-ranking system fully addresses less than 11% of citation rubric items, 48% of limitation items, and 49% of comparison items. We release our data to facilitate more comprehensive multi-field evaluations.
Abstract（参考訳）: 研究クエリに対するロングフォームレスポンスの評価は、専門家アノテータに大きく依存しており、研究者が同僚を便利に登録できるAIのような分野への注意を制限する。しかし、研究の専門知識は広く、調査論文は文献に散在する知識を合成する。我々は75の研究分野から21Kクエリと160Kルーブリックアイテムに調査項目を蒸留し,LLMシステムを評価するためのリソースであるResearchQAを紹介する。各ルーリックは、調査セクションからのクエリと共同で派生し、クエリ固有の回答評価基準、すなわち、論文の引用、説明、制限の記述をリストアップする。 8フィールドの31のPh.D.アノテータによる評価では、クエリの96%がPh.D.情報のニーズをサポートし、87%が文以上のシステム応答に対処すべきである。我々のルーリックを用いて、専門家の判断と74%の合意を得て、自動的なペアワイズ・ジャッジを構築することができる。我々はResearchQAを利用して、18のシステムにおける能力ギャップを7.6K以上のペアワイズ評価で分析する。パラメトリック・検索強化システムでは, 処理対象物の70%以上をカバーし, 上位のエージェント・システムでは75%のカバレッジを示した。誤差分析の結果, 上位のシステムでは引用ルーブリック項目の11%未満, 制限項目の48%, 比較項目の49%に完全に対応していることがわかった。我々は、より包括的なマルチフィールド評価を容易にするために、データを公開します。

論文の概要: ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics

関連論文リスト