Fugu-MT 論文翻訳(概要): DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

論文の概要: DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

arxiv url: http://arxiv.org/abs/2605.22411v1
Date: Thu, 21 May 2026 12:36:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:42.254468
Title: DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA
Title（参考訳）: DeferMem: 長期メモリQAのための強化学習によるクエリ時エビデンス蒸留
Authors: Jianing Yin, Tan Tang,
Abstract要約: 大規模言語モデル(LLM)エージェントは、依然として長期記憶疑問応答に苦慮している。本稿では,この問題をハイリコール候補検索とクエリ条件のエビデンス蒸留に分離する長期記憶フレームワークであるDederMemを紹介する。 LoCoMoとLongMemEval-Sでは、DeferMemはQAの精度とメモリシステムの効率性において強力なベースラインを超えている。
参考スコア（独自算出の注目度）: 3.56156695760535
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language model (LLM) agents still struggle with long-term memory question answering, where answer-supporting evidence is often scattered across long conversational histories and buried in substantial irrelevant content. Existing memory systems typically process memory before future queries are known, then retrieve the resulting units based on similarity rather than their utility for answering the query. This workflow leaves downstream answerers to denoise retrieved candidates and reconstruct query-specific evidence. We present DeferMem, a long-term memory framework that decouples this problem into high-recall candidate retrieval and query-conditioned evidence distillation. DeferMem uses a lightweight segment-link structure to organize raw history and retrieve broad candidates at query time. It then applies a memory distiller trained with DistillPO, our reinforcement learning algorithm for distilling the high-recall but highly noisy candidates into a set of faithful, self-contained, and query-conditioned evidence. DistillPO formulates post-retrieval evidence distillation as a structured action comprising message selection and evidence rewriting. It optimizes this action with a decomposed-and-gated reward pipeline and structure-aligned advantage assignment, gating reward components from validity to quality checks while exposing task-level correctness feedback early and assigning each reward to its responsible output span. On LoCoMo and LongMemEval-S, DeferMem surpasses strong baselines in QA accuracy and memory-system efficiency, achieving the highest QA accuracy with the fastest runtime and zero commercial-API token cost for memory operations.
Abstract（参考訳）: 大規模言語モデル(LLM)エージェントは、長い会話履歴に解答支援の証拠が散在し、無関係な内容に埋もれてしまうという、長期的な記憶質問応答に苦慮している。既存のメモリシステムは、将来のクエリが知られる前にメモリを処理し、クエリに応答するユーティリティではなく、類似性に基づいて結果のユニットを検索する。このワークフローでは、下流の応答者が検索された候補を識別し、クエリ固有のエビデンスを再構築する。本稿では,この問題をハイリコール候補検索とクエリ条件のエビデンス蒸留に分離する長期記憶フレームワークであるDederMemを紹介する。 DeferMemは軽量なセグメントリンク構造を使用して生の履歴を整理し、クエリ時に幅広い候補を検索する。次に、高精細だがノイズの多い候補を忠実で自己完結したクエリ条件のエビデンスに蒸留するための強化学習アルゴリズムであるDistillPOで訓練されたメモリ蒸留器を適用します。 DistillPOは、メッセージの選択とエビデンス書き換えを含む構造化されたアクションとして、検索後のエビデンス蒸留を定式化している。このアクションは、分解・ゲートされた報酬パイプラインと構造に整合した有利な割り当てで最適化され、報酬コンポーネントを妥当性から品質チェックに格上げすると同時に、タスクレベルの正確性フィードバックを早期に公開し、各報酬を責任ある出力スパンに割り当てる。 LoCoMoとLongMemEval-Sでは、DeferMemはQAの精度とメモリシステム効率の強力なベースラインを超え、最も高速なランタイムとメモリ操作のための商用APIトークンコストのゼロで最高のQAの精度を達成する。

論文の概要: DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

関連論文リスト