Fugu-MT 論文翻訳(概要): How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms

論文の概要: How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms

arxiv url: http://arxiv.org/abs/2603.08274v1
Date: Mon, 09 Mar 2026 11:44:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:15.88881
Title: How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms
Title（参考訳）: 文書Q&AシナリオにおけるLLMの幻覚効果
Authors: JV Roig,
Abstract要約: RIKERは、人間のアノテーションを使わずに決定論的スコアリングを可能にする基礎的第一評価手法である。その結果,最も優れたモデルでさえ,非自明な速度で回答を作成できることがわかった。結果はハードウェアプラットフォーム間で一貫性があり、デプロイメントの決定がハードウェアに依存していないことを確認する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: How much do large language models actually hallucinate when answering questions grounded in provided documents? Despite the critical importance of this question for enterprise AI deployments, reliable measurement has been hampered by benchmarks that rely on static datasets vulnerable to contamination, LLM-based judges with documented biases, or evaluation scales too small for statistical confidence. We address this gap using RIKER, a ground-truth-first evaluation methodology that enables deterministic scoring without human annotation. Across 35 open-weight models, three context lengths (32K, 128K, and 200K tokens), four temperature settings, and three hardware platforms (NVIDIA H200, AMD MI300X, and Intel Gaudi 3), we conducted over 172 billion tokens of evaluation - an order of magnitude beyond prior work. Our findings reveal that: (1) even the best-performing models fabricate answers at a non-trivial rate - 1.19% at best at 32K, with top-tier models at 5 - 7% - and fabrication rises steeply with context length, nearly tripling at 128K and exceeding 10% for all models at 200K; (2) model selection dominates all other factors, with overall accuracy spanning a 72-percentage-point range and model family predicting fabrication resistance better than model size; (3) temperature effects are nuanced - T=0.0 yields the best overall accuracy in roughly 60% of cases, but higher temperatures reduce fabrication for the majority of models and dramatically reduce coherence loss (infinite generation loops), which can reach 48x higher rates at T=0.0 versus T=1.0; (4) grounding ability and fabrication resistance are distinct capabilities - models that excel at finding facts may still fabricate facts that do not exist; and (5) results are consistent across hardware platforms, confirming that deployment decisions need not be hardware-dependent.
Abstract（参考訳）: 大きな言語モデルは、提供された文書に埋もれた質問に答えるときに、実際に幻覚を引き起こすのか? エンタープライズAIデプロイメントにおいてこの問題が重要視されているにもかかわらず、信頼性測定は汚染に弱い静的データセットに依存するベンチマークや、文書化されたバイアスを持つLCMベースの審査員、あるいは統計的信頼性のために評価尺度が小さすぎることで妨げられている。 RIKERは,人間のアノテーションを使わずに決定論的スコアリングを可能にする基礎的トラストファースト評価手法である。 35以上のオープンウェイトモデル,3つのコンテキスト長(32K,128K,200Kトークン),4つの温度設定,3つのハードウェアプラットフォーム(NVIDIA H200,AMD MI300X,Intel Gaudi3)に対して,前処理以上の172億以上の評価トークンを実行しました。以上の結果から,(1) 最高の性能のモデルであっても,32K の非自明な速度で 1.19% の回答を生成できる,(2) 最上位モデルでは 5 - 7% の確率で上位モデルでは 128K に近づき,全モデルでは 10% に近づき,(2) モデル選択が他のすべての要因を支配している,(2) 72 パーセントのポイント範囲とモデルファミリーで,モデルサイズよりも製造抵抗を予測できる,(3) 温度効果はニュアンスド – T=0.0 は約60% のケースで最高の総合的精度を達成できるが,より高い温度では,モデルの大部分に対する製造が減少し,コヒーレンス損失(一定の生成ループ)が劇的に減少し,T=0.00 の確率で48倍に向上し,T=0.0 1.0 の精度が向上する,(2) ハードウェアプラットフォーム全体に依存すること,(5) ファクタリングの能力は,ハードウェアプラットフォーム上では不整合性であることが確認できる,といった結果が得られた。

論文の概要: How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms

関連論文リスト