Fugu-MT 論文翻訳(概要): Diagnosing Evidence Utilization in Long-Context and Retrieval-Augmented Language Models under Matched Evidence Conditions

論文の概要: Diagnosing Evidence Utilization in Long-Context and Retrieval-Augmented Language Models under Matched Evidence Conditions

arxiv url: http://arxiv.org/abs/2606.06758v2
Date: Mon, 08 Jun 2026 19:53:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-10 13:21:50.624778
Title: Diagnosing Evidence Utilization in Long-Context and Retrieval-Augmented Language Models under Matched Evidence Conditions
Title（参考訳）: 一致したエビデンス条件下での長期的・検索的言語モデルにおけるエビデンス利用の診断
Authors: Haizhou Xia,
Abstract要約: モデルはパラメトリックの先行情報から回答したり、存在する証拠を使わなかったり、関連するテキストを最終回答に変換することなく引用したりすることができる。本稿では,エビデンス利用評価のための4条件診断プロトコルを提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Final-answer accuracy, retrieval recall, and citation overlap do not reveal how much answer advantage a long-context or retrieval-augmented language model actually recovers from supplied evidence. A model may answer from parametric priors, fail to use evidence that is present, or cite relevant text without converting it into the final answer. This paper introduces a four-condition diagnostic protocol for evidence-utilization evaluation under matched examples, models, prompts, and scoring rules. The protocol compares no-evidence, full-context, retrieved-evidence, and oracle-evidence reference conditions, and uses Oracle-Reference Normalized Context Utilization (ONCU) as a denominator-valid estimate of recovered oracle-reference evidence advantage. The empirical study evaluates five local open-weight models from the Qwen, Gemma, Llama, and Mistral families over Controlled-ONCU-safe16K, HotpotQA-ONCU, and 2WikiMultiHopQA-ONCU, comprising 18,000 ONCU-compatible predictions. Results show a task-dependent diagnostic pattern: controlled synthetic settings expose reduced recovery when the same evidence is embedded in long input rather than supplied compactly, while realistic multi-hop reconstructions show that full-context inputs outperform the tested retrieved inputs in denominator-free answer and evidence metrics, with ONCU supporting the same direction on oracle-improving groups. Sensitivity audits with stronger retrieval settings narrow some gaps but do not overturn the scoped interpretation. The main contribution is therefore not a single utilization ratio, but a matched diagnostic protocol that separates no-evidence answerability, oracle-evidence recoverability, full-context recovery, retrieval-conditioned recovery, denominator validity, and companion answer/evidence diagnostics.
Abstract（参考訳）: 最終回答精度、検索リコール、引用重複は、長文または検索拡張言語モデルが実際に供給された証拠からどれだけの回答を得られるかを明らかにしない。モデルはパラメトリックの先行情報から回答したり、存在する証拠を使わなかったり、関連するテキストを最終回答に変換することなく引用したりすることができる。本稿では,実例,モデル,プロンプト,スコアリングルールに基づくエビデンス利用評価のための4条件診断プロトコルを提案する。 Oracle-Reference Normalized Context utilization (ONCU) は、回収されたオラクル-Reference-Referenceエビデンス(英語版)のデノミネータ-有意な推定値として、非エビデンス、完全コンテキスト、検索されたエビデンス、およびオラクル-Reference Normalized Context utilization (ONCU) を用いる。 The empirical study evaluates five local open-weight model from the Qwen, Gemma, Llama, and Mistral family over Controlled-ONCU-safe16K, HotpotQa-ONCU, and 2WikiMultiHopQa-ONCU。その結果, 制御された合成設定は, 同じエビデンスをコンパクトに供給するのではなく, 長時間のインプットに埋め込んだ場合のリカバリを減少させるが, リアルなマルチホップ再構成では, テストされたインプットが, ディノミネータレス回答とエビデンスメトリクスで優れており, ONCUはオラクル改善グループで同じ方向をサポートする。より強力な検索設定を持つ感性監査は、いくつかのギャップを狭めるが、スコープ化された解釈を覆すことはない。したがって、主な貢献は、単一利用率ではなく、無証拠回答可能性、オラクル・エビデンス回復性、完全コンテキスト回復、検索条件回復、分母の妥当性、および相補的回答/エビデンス診断を分離する一致した診断プロトコルである。

論文の概要: Diagnosing Evidence Utilization in Long-Context and Retrieval-Augmented Language Models under Matched Evidence Conditions

関連論文リスト