Fugu-MT 論文翻訳(概要): Rethinking Evaluation for LLM Hallucination Detection: A Desiderata, A New RAG-based Benchmark, New Insights

論文の概要: Rethinking Evaluation for LLM Hallucination Detection: A Desiderata, A New RAG-based Benchmark, New Insights

arxiv url: http://arxiv.org/abs/2605.11330v1
Date: Mon, 11 May 2026 23:33:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.477996
Title: Rethinking Evaluation for LLM Hallucination Detection: A Desiderata, A New RAG-based Benchmark, New Insights
Title（参考訳）: LLM幻覚検出のための再考:新しいRAGベンチマークであるDesiderata
Authors: Wenbo Chen, Veena Padmanabhan, Tootiya Giyahchi, Elaine Wong, Leman Akoglu,
Abstract要約: 我々は、厳格な人間のアノテーションプロセスを実行した新しいRAGベースのHDB、T RIVIA+を構築し、オープンソース化した。特に,本ベンチマークでは,(1)T RIVIA+は文献中で最も長い文脈のサンプルを含む,望ましいすべての特性を示す。我々は、一般的なSOTA検出器を用いて、RAGベースのHDB(T RIVIA+を含む)の実験を行い、新しい知見を明らかにした。
参考スコア（独自算出の注目度）: 14.723073002701492
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hallucination, broadly referring to unfaithful, fabricated, or inconsistent content generated by LLMs, has wide-ranging implications. Therefore, a large body of effort has been devoted to detecting LLM hallucinations, as well as designing benchmark datasets for evaluating these detectors. In this work, we first establish a desiderata of properties for hallucination detection benchmarks (HDBs) to exhibit for effective evaluation. A critical look at existing HDBs through the lens of our desiderata reveals that none of them exhibits all the properties. We identify two largest gaps: (1) RAG-based grounded benchmarks with long context are severely lacking (partly because length impedes human annotation); and (2) Existing benchmarks do not make available realistic label noise for stress-testing detectors although real-world use-cases often grapple with label noise due to human or automated/weak annotation. To close these gaps, we build and open-source a new RAG-based HDB called T RIVIA+ that underwent a rigorous human annotation process. Notably, our benchmark exhibits all desirable properties including (1) T RIVIA+ contains samples with the longest context in the literature; and (2) we design and share four sets of noisy labels with different, both sample-dependent and sampleindependent, noise schemes. Finally, we perform experiments on RAG-based HDBs, including our T RIVIA+, using popular SOTA detectors that reveal new insights: (i) ample room remains for current detectors to reach the performance ceiling on RAG-based HDBs, (ii) the basic LLM-as-a-Judge baseline performs competitively, and (iii) label noise hinders detection performance. We expect that our findings, along with our proposed benchmark 1 , will motivate and foster needed research on hallucination detection for RAG-based tasks.
Abstract（参考訳）: 幻覚は、LLMによって生成される不誠実、製造された、または一貫性のない内容を指し、広範囲に影響を及ぼす。そのため、LSM幻覚の検出や、これらの検出器を評価するためのベンチマークデータセットの設計に多くの努力が注がれている。本研究では,まず,幻覚検出ベンチマーク(HDB)の特性のデシラタを構築し,有効評価を行う。我々のデシラタのレンズを通して既存のHDBを批判的に見ると、どれもすべての特性を示していないことが分かる。従来のベンチマークでは,人間や自動弱アノテーションによるラベルノイズが頻繁に発生するが,既存のベンチマークではストレステスト検出器では現実的なラベルノイズが得られない。これらのギャップを埋めるために、我々は、厳格な人間のアノテーションプロセスを実行したT RIVIA+と呼ばれる新しいRAGベースのHDBを構築し、オープンソース化した。特に,本ベンチマークでは,(1)T RIVIA+は文献中で最長コンテキストのサンプルを含み,(2)サンプル依存・サンプル非依存の4種類のノイズラベルを設計・共有している。最後に、私たちのT RIVIA+を含むRAGベースのHDBに関する実験を行い、一般的なSOTA検出器を使って新しい知見を明らかにします。 (i)現在の検出器がRAGベースのHDBの性能天井に達するには十分なスペースが残っている。二 LLM-as-a-Judgeベースラインの競争力、及び三) ラベルノイズは検出性能を阻害する。我々は,提案したベンチマーク1とともに,RAGベースのタスクに対する幻覚検出に必要な研究を動機づけ,促進することを期待している。

論文の概要: Rethinking Evaluation for LLM Hallucination Detection: A Desiderata, A New RAG-based Benchmark, New Insights

関連論文リスト