Fugu-MT 論文翻訳(概要): LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation

論文の概要: LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation

arxiv url: http://arxiv.org/abs/2511.14531v1
Date: Tue, 18 Nov 2025 14:34:35 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-19 16:23:53.156208
Title: LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation
Title（参考訳）: LiveRAG:RAG評価の難易度が異なる多様なQ&Aデータセット
Authors: David Carmel, Simone Filice, Guy Horowitz, Yoelle Maarek, Alex Shtoff, Oren Somekh, Ran Tavory,
Abstract要約: 我々は、RAGベースのQ&Aシステムの体系的評価を支援するために設計された895の合成質問と回答のデータセットであるLiveRAGベンチマークを紹介する。この合成ベンチマークは、SIGIR'2025 LiveRAG Challengeで使用されるもので、競争相手は厳格な時間制約の下で評価された。我々の分析では、ベンチマークの多様性、難易度の範囲、システム機能間の差別化におけるそれらの有用性について強調している。
参考スコア（独自算出の注目度）: 12.341210252539776
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With Retrieval Augmented Generation (RAG) becoming more and more prominent in generative AI solutions, there is an emerging need for systematically evaluating their effectiveness. We introduce the LiveRAG benchmark, a publicly available dataset of 895 synthetic questions and answers designed to support systematic evaluation of RAG-based Q&A systems. This synthetic benchmark is derived from the one used during the SIGIR'2025 LiveRAG Challenge, where competitors were evaluated under strict time constraints. It is augmented with information that was not made available to competitors during the Challenge, such as the ground-truth answers, together with their associated supporting claims which were used for evaluating competitors' answers. In addition, each question is associated with estimated difficulty and discriminability scores, derived from applying an Item Response Theory model to competitors' responses. Our analysis highlights the benchmark's questions diversity, the wide range of their difficulty levels, and their usefulness in differentiating between system capabilities. The LiveRAG benchmark will hopefully help the community advance RAG research, conduct systematic evaluation, and develop more robust Q&A systems.
Abstract（参考訳）: Retrieval Augmented Generation(RAG)は、ジェネレーティブAIソリューションにおいてますます注目されるようになり、その効果を体系的に評価する必要性が高まっている。我々は、RAGベースのQ&Aシステムの体系的評価を支援するように設計された895の合成質問と回答のデータセットであるLiveRAGベンチマークを紹介する。この合成ベンチマークは、SIGIR'2025 LiveRAG Challengeで使用されるもので、競争相手は厳格な時間制約の下で評価された。挑戦期間中に競技者が入手できなかった情報、例えば真剣な回答、そして競技者の回答を評価するために使用された彼らのサポートクレームが強化されている。さらに、各質問は、競合相手の反応にアイテム反応理論モデルを適用することから、推定難易度と識別可能性スコアに関連付けられている。我々の分析では、ベンチマークの多様性、難易度の範囲、システム機能間の差別化におけるそれらの有用性について強調している。 LiveRAGベンチマークは、コミュニティがRAG研究を前進させ、体系的な評価を行い、より堅牢なQ&Aシステムを開発するのに役立つことを期待している。

論文の概要: LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation

関連論文リスト