Fugu-MT 論文翻訳(概要): MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning

論文の概要: MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning

arxiv url: http://arxiv.org/abs/2606.04442v1
Date: Wed, 03 Jun 2026 04:44:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.544495
Title: MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning
Title（参考訳）: MemoryDocDataSet: 対話型メモリと長期ドキュメント推論のためのベンチマーク
Authors: Qiyang Xie, Jialun Wu, Xinjie He, Su Liu, Shuai Xiao, Zhiyuan Lin, Weikai Zhou,
Abstract要約: MemoryDocDataSetは、50マイクロワールドと1000QAペアの総合ベンチマークである。それぞれのインスタンスは、3～5のペルソナ、数ヶ月のアクティビティにまたがる一時的なイベントグラフ、3～5の実際の長いドキュメント、それらのドキュメントに基づくマルチセッションの会話で構成されている。定義されている特徴は、ハイブリッドソースタグである: システムが最初に会話履歴をナビゲートし、どのドキュメントが関連しているかを特定し、そのドキュメントから回答を抽出する。
参考スコア（独自算出の注目度）: 6.180594609315986
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI systems increasingly need to combine two demanding capabilities: navigating multi-session conversation history and performing deep reading comprehension within long documents. Yet no existing benchmark evaluates both simultaneously. We introduce MemoryDocDataSet, a synthetic benchmark of 50 micro-worlds and 1,000 QA pairs in which each instance comprises 3-5 personas, a temporal event graph spanning months of activity, 3-5 real long documents (20,000-50,000 tokens each sourced from the Caselaw Access Project), multi-session conversations grounded on those documents, and 20 question-answer pairs across five reasoning categories. The defining feature is the Hybrid source tag: questions requiring a system to first navigate conversation history to identify which document is relevant, then extract the answer from within that document. Hybrid questions account for 75.1% of the dataset. Dataset quality is characterised through a prompt-sensitivity self-consistency analysis using LLM-as-judge, yielding a median Cohen's $κ= 0.634$ across all 50 micro-worlds. We evaluate six baseline configurations spanning truncated context, long-context LLMs, retrieval-augmented generation (RAG), and memory systems. The best baseline (RAG-Both) achieves 0.358 overall F1 and 0.342 on Hybrid. Document-only retrieval (RAG-Doc) collapses to 0.267 on Hybrid despite achieving 0.453 on Doc-only questions, demonstrating a clear joint-retrieval gap that motivates architectures unifying conversational memory with long-document navigation. We release the dataset, generation pipeline, and all baseline implementations.
Abstract（参考訳）: AIシステムは、多セッション会話履歴をナビゲートし、長いドキュメント内で深い読み理解を実行するという、2つの要求機能を組み合わせる必要がある。しかし、両者を同時に評価するベンチマークは存在しない。 MemoryDocDataSetは50のマイクロワールドと1000のQAペアの総合ベンチマークで、各インスタンスは3-5のペルソナ、月ごとのアクティビティにまたがる時間的イベントグラフ、3-5のリアルタイムな長いドキュメント(ケースローアクセスプロジェクトからそれぞれ20,000～50,000のトークン)、それらのドキュメントに基づくマルチセッション会話、20の質問応答ペアで構成されています。定義されている特徴は、ハイブリッドソースタグである: システムが最初に会話履歴をナビゲートし、どのドキュメントが関連しているかを特定し、そのドキュメントから回答を抽出する。ハイブリッドな質問はデータセットの75.1%を占めている。データセットの品質は、LLM-as-judgeを用いた迅速な自己整合性分析によって特徴づけられる。提案手法は, truncated context, long-context LLMs, retrieve-augmented generation (RAG) およびメモリシステムにまたがる6つのベースライン構成を評価する。最良のベースライン(RAG-Both)は、総合F1の0.358、ハイブリッドの0.342である。ドキュメントのみの検索 (RAG-Doc) は、ドキュメントのみの質問で 0.453 を達成しているにもかかわらず、Hybrid上で 0.267 に崩壊する。データセット、生成パイプライン、およびすべてのベースライン実装をリリースします。

論文の概要: MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning

関連論文リスト