Fugu-MT 論文翻訳(概要): Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries

論文の概要: Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries

arxiv url: http://arxiv.org/abs/2510.11956v1
Date: Mon, 13 Oct 2025 21:38:04 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 19:02:32.099942
Title: Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries
Title（参考訳）: 未解決, 処理不能, 現実的, マルチホップクエリにおける検索型生成システムの評価
Authors: Gabrielle Kaili-May Liu, Bryan Li, Arman Cohan, William Gantt Walden, Eugene Yang,
Abstract要約: 実世界のユースケースでは、複雑なクエリを持つRAGシステムが存在し、関連する情報がコーパスから欠落したり、不完全であったりすることが多い。既存のRAGベンチマークは、マルチホップやスコープ外の質問に対して、現実的なタスクの複雑さを反映することはめったにない。 un$underlinec$heatable, $underliner$ealistic, $underlineu$nanswerable, $underlinem$ulti-hopの自動生成のための最初のパイプラインを提示する。
参考スコア（独自算出の注目度）: 53.99620546358492
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Real-world use cases often present RAG systems with complex queries for which relevant information is missing from the corpus or is incomplete. In these settings, RAG systems must be able to reject unanswerable, out-of-scope queries and identify failures of retrieval and multi-hop reasoning. Despite this, existing RAG benchmarks rarely reflect realistic task complexity for multi-hop or out-of-scope questions, which often can be cheated via disconnected reasoning (i.e., solved without genuine multi-hop inference) or require only simple factual recall. This limits the ability for such benchmarks to uncover limitations of existing RAG systems. To address this gap, we present the first pipeline for automatic, difficulty-controlled creation of un$\underline{c}$heatable, $\underline{r}$ealistic, $\underline{u}$nanswerable, and $\underline{m}$ulti-hop $\underline{q}$uerie$\underline{s}$ (CRUMQs), adaptable to any corpus and domain. We use our pipeline to create CRUMQs over two popular RAG datasets and demonstrate its effectiveness via benchmark experiments on leading retrieval-augmented LLMs. Results show that compared to prior RAG benchmarks, CRUMQs are highly challenging for RAG systems and achieve up to 81.0\% reduction in cheatability scores. More broadly, our pipeline offers a simple way to enhance benchmark difficulty and realism and drive development of more capable RAG systems.
Abstract（参考訳）: 実世界のユースケースでは、複雑なクエリを持つRAGシステムが存在し、関連する情報がコーパスから欠落したり、不完全であったりすることが多い。これらの設定では、RAGシステムは解決不可能なスコープ外クエリを拒否し、検索の失敗とマルチホップ推論を識別できなければならない。それにもかかわらず、既存のRAGベンチマークは、マルチホップやスコープ外質問の現実的なタスクの複雑さを反映することは滅多になく、これはしばしば非連結推論(すなわち、真のマルチホップ推論なしで解決される)や単純なファクトリコールのみを必要とする。これにより、既存のRAGシステムの制限を明らかにすることができる。このギャップに対処するため、un$\underline{c}$heatable, $\underline{r}$ealistic, $\underline{u}$nanswerable, $\underline{m}$ulti-hop $\underline{q}$uerie$\underline{s}$ (CRUMQs) の自動生成のための最初のパイプラインを提示します。パイプラインを使用して、2つの人気のあるRAGデータセット上でCRUMQを作成し、主要な検索拡張LDMに関するベンチマーク実験を通じてその効果を実証します。その結果、従来のRAGベンチマークと比較すると、CRUMQはRAGシステムでは極めて困難であり、不正性スコアの最大81.0\%の削減を実現していることがわかった。より広範に、私たちのパイプラインは、ベンチマークの難しさとリアリズムを高め、より有能なRAGシステムの開発を促進する簡単な方法を提供します。

論文の概要: Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries

関連論文リスト