Fugu-MT 論文翻訳(概要): Do LLM-judges Align with Human Relevance in Cranfield-style Recommender Evaluation?

論文の概要: Do LLM-judges Align with Human Relevance in Cranfield-style Recommender Evaluation?

arxiv url: http://arxiv.org/abs/2511.23312v1
Date: Fri, 28 Nov 2025 16:10:39 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-01 19:47:55.975872
Title: Do LLM-judges Align with Human Relevance in Cranfield-style Recommender Evaluation?
Title（参考訳）: クランフィールド型レコメンダ評価におけるLLM-judgesと人間関係は一致しているか?
Authors: Gustavo Penha, Aleksandr V. Petrov, Claudia Hauff, Enrico Palumbo, Ali Vardasbi, Edoardo D'Amico, Francesco Fabbri, Alice Wang, Praveen Chandar, Henrik Lindstrom, Hugues Bouchard, Mounia Lalmas,
Abstract要約: 本稿では,Large Language Models (LLM) がスケーラビリティ問題に対処するために,信頼性の高い自動判断器として機能するかどうかを検討する。 ML-32M-ext Cranfieldスタイルの映画レコメンデーションコレクションを用いて,既存の評価手法の限界について検討する。よりリッチな項目メタデータとより長いユーザ履歴を組み合わせることでアライメントが向上し,LLM-judgeは人間によるランキングと高い合意を得ることがわかった。
参考スコア（独自算出の注目度）: 40.49875426230813
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Evaluating recommender systems remains a long-standing challenge, as offline methods based on historical user interactions and train-test splits often yield unstable and inconsistent results due to exposure bias, popularity bias, sampled evaluations, and missing-not-at-random patterns. In contrast, textual document retrieval benefits from robust, standardized evaluation via Cranfield-style test collections, which combine pooled relevance judgments with controlled setups. While recent work shows that adapting this methodology to recommender systems is feasible, constructing such collections remains costly due to the need for manual relevance judgments, thus limiting scalability. This paper investigates whether Large Language Models (LLMs) can serve as reliable automatic judges to address these scalability challenges. Using the ML-32M-ext Cranfield-style movie recommendation collection, we first examine the limitations of existing evaluation methodologies. Then we explore the alignment and the recommender systems ranking agreement between the LLM-judge and human provided relevance labels. We find that incorporating richer item metadata and longer user histories improves alignment, and that LLM-judge yields high agreement with human-based rankings (Kendall's tau = 0.87). Finally, an industrial case study in the podcast recommendation domain demonstrates the practical value of LLM-judge for model selection. Overall, our results show that LLM-judge is a viable and scalable approach for evaluating recommender systems.
Abstract（参考訳）: 過去のユーザインタラクションとトレインテストの分割に基づくオフラインメソッドは、露出バイアス、人気バイアス、サンプル評価、非ランダムパターンの欠如によって不安定で一貫性のない結果をもたらすことが多いため、リコメンダシステムの評価は長年の課題である。対照的に、テキスト文書検索はCranfieldスタイルのテストコレクションによる堅牢で標準化された評価の恩恵を受ける。最近の研究は、この方法論をレコメンデーションシステムに適用することは可能であることを示しているが、手動の関連性判断を必要とするため、そのようなコレクションの構築はコストがかかるままであり、スケーラビリティが制限される。本稿では,Large Language Models (LLM) が,これらの拡張性に対処するための信頼性の高い自動判断器として機能するかどうかを検討する。 ML-32M-ext Cranfieldスタイルの映画レコメンデーションコレクションを用いて,既存の評価手法の限界について検討する。次に、LLM-judgeと人間提供関連ラベルのアライメントとレコメンダシステムランキングについて検討する。よりリッチな項目メタデータとより長いユーザ履歴を組み合わせることでアライメントが向上し,LLM-judgeは人間によるランキング(Kendall's tau = 0.87)と高い合意を得ることがわかった。最後に、ポッドキャストレコメンデーションドメインにおける産業ケーススタディは、モデル選択におけるLLM-judgeの実用的価値を示す。以上の結果から,LLM-judgeはレコメンデータシステムを評価するための,実用的でスケーラブルなアプローチであることが示唆された。

論文の概要: Do LLM-judges Align with Human Relevance in Cranfield-style Recommender Evaluation?

関連論文リスト