Fugu-MT 論文翻訳(概要): Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison

論文の概要: Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison

arxiv url: http://arxiv.org/abs/2605.30087v1
Date: Thu, 28 May 2026 15:33:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.427577
Title: Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison
Title（参考訳）: マルチソース・パーソナルメモリの競合に関する選択的QA:診断テストベッドと方法の比較
Authors: Tiancheng Yang, Matthias Schonlau, Ilia Sucholutsky,
Abstract要約: 既存のベンチマークでは、メソッドに与えられたエビデンスやメソッドのコンフリクト解決ステップからエラーが生じたかどうかはほとんど示されていない。我々はこれをマルチソース・パーソナルメモリの競合に対する選択的QAとして検討する。 8種類の推論型,480のペルソナ,4つのランダムシード,34,560のインスタンスを対象とした18の質問テンプレートを含むベンチマークを作成した。
参考スコア（独自算出の注目度）: 11.187819120306825
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Emerging personal AI agents are moving toward persistent, multi-source memory. This creates an evaluation problem: systems must decide how to use conflicting or incomplete evidence; they cannot just retrieve facts from one clean history. Existing benchmarks rarely show whether an error came from the evidence given to a method or from the method's conflict-resolution step. We study this as selective QA over conflicting multi-source personal memory: systems answer based on conflicting, sometimes incomplete sources, or abstain when evidence is insufficient. We develop a benchmark containing 18 question templates across 8 reasoning types, 480 personas, 4 random seeds, and 34,560 instances, with controlled source distortions and deterministic ground truth. We evaluate the performance of baselines without access to any source, access to a single source, structured fusion methods, and frontier LLMs. The best trained fusion resolver reaches 80.3% accuracy, while the strongest prompt-only LLM baseline reaches 70.0%. With abstention, the same resolver reaches 85.3% selective accuracy at 78.3% coverage and the best LLM reaches 71.0% selective accuracy at 95.4% coverage. Different models have different strengths across reasoning types. We release the data, code, cached model outputs, and data-generating process for reuse.
Abstract（参考訳）: 新たなパーソナルAIエージェントは、永続的でマルチソースなメモリへと移行している。システムは矛盾する証拠や不完全な証拠の使い方を判断しなければならない。既存のベンチマークでは、メソッドに与えられたエビデンスやメソッドのコンフリクト解決ステップからエラーが生じたかどうかはほとんど示されていない。我々は、これをマルチソースのパーソナルメモリの競合に対する選択的QAとして検討する。本研究では,8つの推論型,480のペルソナ,4つのランダムシード,34,560のインスタンスを対象とした18の質問テンプレートを含むベンチマークを作成した。我々は,ソースへのアクセス,単一ソースへのアクセス,構造的融合法,フロンティアLCMを使わずにベースラインの性能を評価する。最も訓練された核融合リゾルバは80.3%、最強のプロンプトのみのLDMベースラインは70.0%に達する。禁断の場合、同じリゾルバは78.3%のカバレッジで85.3%の選択的精度に達し、最高のLCMは95.4%のカバレッジで71.0%の選択的精度に達する。異なるモデルは推論タイプによって異なる強度を持つ。データ、コード、キャッシュされたモデル出力、再利用のためのデータ生成プロセスをリリースします。

論文の概要: Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison

関連論文リスト