Fugu-MT 論文翻訳(概要): A Four-Condition Diagnostic Protocol for Evidence Utilization in Long-Context and Retrieval-Augmented Language Models

論文の概要: A Four-Condition Diagnostic Protocol for Evidence Utilization in Long-Context and Retrieval-Augmented Language Models

arxiv url: http://arxiv.org/abs/2606.06758v1
Date: Thu, 04 Jun 2026 22:44:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.471349
Title: A Four-Condition Diagnostic Protocol for Evidence Utilization in Long-Context and Retrieval-Augmented Language Models
Title（参考訳）: 長期・検索言語モデルにおけるエビデンス利用のための4段階診断プロトコル
Authors: Haizhou Xia,
Abstract要約: モデルはパラメトリックメモリから答えることができ、正しいパスを受け取っているにもかかわらず失敗するか、要求された回答に変換せずに証拠を引用することができる。本報告では, 一致した4条件エビデンス・アベイラビリティープロトコル, 完全文脈, 検索されたエビデンス, オラクル・エビデンス参照を提案する。 OnCUは、回収されたオラクル参照証拠のプロトコルバウンド推定器として使用される。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Final-answer accuracy, retrieval recall, and citation overlap do not by themselves identify whether a long-context or retrieval-augmented language model used the evidence it was given. A model can answer from parametric memory, fail despite receiving the right passages, or cite evidence without converting it into the requested answer. This paper proposes a matched four-condition evidence-availability protocol--no evidence, full context, retrieved evidence, and oracle-evidence reference--for diagnosing evidence utilization under fixed examples, prompts, score fields, retrieval settings, and validity checks. ONCU is used as a protocol-bound estimator of recovered oracle-reference evidence advantage and is computed only for denominator-valid groups; denominator-free answer, evidence, retrieval, and failure-audit metrics are reported separately. The empirical study evaluates five local open-weight models from the Qwen, Gemma, Llama, and Mistral families across Controlled-ONCU-safe16K, HotpotQA-ONCU, and 2WikiMultiHopQA-ONCU, with 18,000 ONCU-compatible predictions. The main finding is a task-dependent bottleneck split: controlled synthetic settings primarily expose full-context utilization failures, whereas the tested realistic multi-hop settings primarily expose retrieval-chain coverage failures in denominator-free answer and evidence metrics, with ONCU supporting the same direction on oracle-improving groups. The contribution is a diagnostic protocol for separating no-evidence answerability, oracle-evidence recoverability, full-context utilization, and retrieval-conditioned utilization, rather than a single-score leaderboard for long-context or retrieval-augmented systems.
Abstract（参考訳）: 最終回答精度、検索リコール、引用重複は、長文または検索拡張言語モデルがその証拠を使用したかどうかをそれ自体が特定しない。モデルはパラメトリックメモリから答えることができ、正しいパスを受け取っているにもかかわらず失敗するか、要求された回答に変換せずに証拠を引用することができる。本稿では,定例,プロンプト,スコアフィールド,検索設定,妥当性チェックの4つの条件付きエビデンス・アベイラビリティプロトコルを提案する。 ONCUは、回収されたオラクル-参照証拠のプロトコルバウンド・エビデンス・エビデンス・エビデンス・エビデンス・エビデンス(英語版)の指標として使われ、デノミネーター-バリッド・グループのみに計算され、デノミネーターなしの回答、エビデンス、検索、障害監査のメトリクスは別々に報告される。 The empirical study evaluates five local open-weight model from the Qwen, Gemma, Llama, and Mistral family across Controlled-ONCU-safe16K, HotpotQa-ONCU, and 2WikiMultiHopQa-ONCU with 18,000 ONCU- compatible predictions。コントロールされたシンセサイザー設定は、主にフルコンテキスト利用障害を露呈するのに対して、テスト対象のマルチホップ設定は、デノミネータなし回答とエビデンスメトリクスで検索チェーンカバレッジ障害を露呈するのに対して、ONCUは、オラクル改善グループで同じ方向をサポートする。このコントリビューションは、長いコンテキストや検索拡張システムのためのシングルスコアのリーダーボードではなく、無証拠の回答可能性、オラクルの証拠回復性、全コンテキストの利用、検索条件の活用を分離するための診断プロトコルである。

論文の概要: A Four-Condition Diagnostic Protocol for Evidence Utilization in Long-Context and Retrieval-Augmented Language Models

関連論文リスト