Fugu-MT 論文翻訳(概要): Formalized Information Needs Improve Large-Language-Model Relevance Judgments

論文の概要: Formalized Information Needs Improve Large-Language-Model Relevance Judgments

arxiv url: http://arxiv.org/abs/2604.04140v1
Date: Sun, 05 Apr 2026 14:59:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:18.942085
Title: Formalized Information Needs Improve Large-Language-Model Relevance Judgments
Title（参考訳）: 大規模言語モデル関連判断を改善する情報形式化
Authors: Jüri Keller, Maik Fröbe, Björn Engelmann, Fabian Haak, Timo Breuer, Birger Larsen, Philipp Schaer,
Abstract要約: クランフィールド式検索評価では、関連文書が多すぎるか多すぎるか、あるいは関連性に関する評価間合意が低い場合、観測の信頼性を低下させる可能性がある。ヒューマンアセステータによる評価では、情報要求が検索トピックとして形式化され、関連ドキュメントの過剰な数を避ける。我々は,Large Language Models (LLMs) を用いた情報ニーズを,従来の人間関係評価から確立された構造に従うトピックに合成的に定式化する。
参考スコア（独自算出の注目度）: 12.789247779450688
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cranfield-style retrieval evaluations with too few or too many relevant documents or with low inter-assessor agreement on relevance can reduce the reliability of observations. In evaluations with human assessors, information needs are often formalized as retrieval topics to avoid an excessive number of relevant documents while maintaining good agreement. However, emerging evaluation setups that use Large Language Models (LLMs) as relevance assessors often use only queries, potentially decreasing the reliability. To study whether LLM relevance assessors benefit from formalized information needs, we synthetically formalize information needs with LLMs into topics that follow the established structure from previous human relevance assessments (i.e., descriptions and narratives). We compare assessors using synthetically formalized topics against the LLM-default query-only assessor on Robust04 and the 2019/2020 editions of TREC Deep Learning. We find that assessors without formalization judge many more documents relevant and have a lower agreement, leading to reduced reliability in retrieval evaluations. Furthermore, we show that the formalized topics improve agreement between human and LLM relevance judgments, even when the topics are not highly similar to their human counterparts. Our findings indicate that LLM relevance assessors should use formalized information needs, as is standard for human assessment, and synthetically formalize topics when no human formalization exists to improve evaluation reliability.
Abstract（参考訳）: クランフィールド式検索評価では、関連文書が多すぎるか多すぎるか、あるいは関連性に関する評価間合意が低い場合、観測の信頼性を低下させる可能性がある。ヒューマンアセステータによる評価では、情報要求を検索トピックとして形式化し、適切な合意を維持しつつ、関連文書の過剰な数を避ける。しかし、関連性評価器としてLLM(Large Language Models)を使用する新たな評価設定では、クエリのみを使用することが多く、信頼性が低下する可能性がある。 LLMの関連性評価者が形式化された情報要求から恩恵を受けるかどうかを調べるため、従来の人間関係評価(説明や物語など)から確立された構造に従うトピックにLLMを用いた情報要求を合成的に形式化する。我々は,LLM-デフォルトクエリ専用評価器であるRobost04とTREC Deep Learningの2019/2020版に対して,合成形式化されたトピックを用いた評価器を比較した。形式化のない評価者は、関連文書を多く判断し、合意が低くなり、検索評価の信頼性が低下することがわかった。さらに,定式化されたトピックは,人間とLLMの関連判断の一致を向上することを示した。以上の結果から,LCM関連評価者は,人的評価の標準として形式化された情報を必要とするべきであり,人的形式化が存在しない場合のトピックを合成的に形式化し,信頼性を向上させる必要があることが示唆された。

論文の概要: Formalized Information Needs Improve Large-Language-Model Relevance Judgments

関連論文リスト