Fugu-MT 論文翻訳(概要): MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

論文の概要: MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

arxiv url: http://arxiv.org/abs/2605.18565v2
Date: Tue, 19 May 2026 16:05:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:08.581294
Title: MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems
Title（参考訳）: MINTEval: 長距離エージェントシステムにおけるマルチターゲット干渉によるメモリ評価
Authors: Hyunji Lee, Justin Chih-Yao Chen, Joykirat Singh, Zaid Khan, Elias Stengel-Eskin, Mohit Bansal,
Abstract要約: 本研究では,現在の記憶増強剤が現実的,干渉重大,長期的設定において果たす役割について検討する。 MINTEvalは、頻繁に更新される情報を備えた、長く高度に相互接続されたコンテキストを特徴とするベンチマークである。 MINTEvalは128.8kのトークンを平均で1インスタンスあたり1.8Mのトークンに拡張し、15.6kの質問応答ペアを持つ。
参考スコア（独自算出の注目度）: 69.06764269022925
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent recall and fail to capture these dynamic interactions between evolving memories. In this paper, we study how current memory-augmented agents perform in realistic, interference-heavy, long-horizon settings across diverse domains and question types. We introduce MINTEval (Long-Horizon Memory under INTerference Evaluation), a benchmark featuring (1) long, highly interconnected contexts with frequently updated information that induces substantial interference, (2) diverse domains (state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits), enabling evaluation of domain generalization, and (3) diverse question types that assess robustness to interference, including (i) single-target recall tasks requiring retrieval of a specific target from long contexts, and (ii) multi-target aggregation tasks requiring reasoning over multiple relevant pieces of information. Overall, MINTEval has 15.6k question-answering pairs over long-horizon contexts averaging 138.8k tokens and extending up to 1.8M tokens per instance. We evaluate 7 representative systems, including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks. Across all systems, we observe consistently low performance (avg. 27.9% accuracy), especially on questions requiring aggregated reasoning over multiple pieces of evidence. Our analysis shows that performance is primarily limited by retrieval and memory construction. Furthermore, current memory systems struggle to recall and reason over earlier facts that are revised or interfered with by subsequent context, with accuracy degrading as the number of intervening updates increases.
Abstract（参考訳）: 現実世界のエージェントは、情報が繰り返し更新され、記憶に干渉し、正確なリコールと複数の情報の集合的推論を必要とする。しかし、既存のベンチマークは静的で独立したリコールに焦点を当てており、進化するメモリ間の動的な相互作用をキャプチャできない。本稿では,現在のメモリ拡張エージェントが,様々な領域や質問タイプにまたがる現実的,干渉重大,長期的設定において,どのように機能するかを検討する。我々は,(1)かなりの干渉を誘発する頻繁な更新情報を持つ長い相互接続されたコンテキスト,(2)多様なドメイン(状態追跡,マルチターンダイアログ,ウィキペディアリビジョン,GitHubコミット),ドメインの一般化の評価を可能にする,(3)干渉に対する堅牢性を評価する多様な質問タイプを特徴とする,MINTEval(Long-Horizon Memory under INTerference Evaluation)のベンチマークを紹介する。一長期の状況から特定の目標の検索を必要とする単一目標リコール作業 (2)複数の関連情報に対する推論を必要とする多目的集約タスク。 MINTEvalは128.8kのトークンを平均で1インスタンスあたり1.8Mのトークンを平均して15.6kの質問応答ペアを持つ。我々は,Vanilla long-context LLMs,RAG,およびメモリ拡張エージェントフレームワークを含む7つの代表的なシステムを評価する。全てのシステムにおいて、一貫して低い性能(約27.9%の精度)を観察し、特に複数の証拠の集合的推論を必要とする問題について考察する。分析の結果,検索やメモリ構築によって性能が制限されることが判明した。さらに、現在のメモリシステムは、更新が増加するにつれて精度が低下し、後続のコンテキストによって修正または妨害された過去の事実をリコールし、推論するのに苦労している。

論文の概要: MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

関連論文リスト