Fugu-MT 論文翻訳(概要): Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

論文の概要: Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

arxiv url: http://arxiv.org/abs/2511.20857v1
Date: Tue, 25 Nov 2025 21:08:07 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-27 18:37:58.863409
Title: Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Title（参考訳）: Evo-Memory: 自己進化型メモリによるLLMエージェントテストタイムラーニングのベンチマーク
Authors: Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng,
Abstract要約: Evo-Memoryは、大規模言語モデル(LLM)エージェントで自己進化型メモリを評価するための、ストリーミングベンチマークとフレームワークである。 10以上の代表的なメモリモジュールを評価し、10種類の多ターンゴール指向およびシングルターン推論およびQAデータセットで評価した。
参考スコア（独自算出の注目度）: 89.65731902036669
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.
Abstract（参考訳）: 大規模言語モデル(LLM)エージェントが長期計画と問題解決を行うためには、ステートフルネスが不可欠である。これはメモリを重要なコンポーネントにしますが、その管理と進化はいまだに過小評価されています。既存の評価は主に静的な会話の設定に重点を置いており、対話からメモリを受動的に取り出してクエリに応答する。インタラクティブな問題アシスタントやエンボディエージェントのような現実の環境では、LLMは継続的タスクストリームを処理するために必要だが、蓄積されたインタラクションから学習することができないことが多い。このギャップを埋めるために、私たちはLLMエージェントの自己進化メモリを評価するための包括的なストリーミングベンチマークとフレームワークであるEvo-Memoryを紹介します。 Evo-Memory構造はシーケンシャルなタスクストリームにデータセットを組み、各インタラクション後のメモリの検索、適応、進化をLCMに要求する。 10以上の代表的なメモリモジュールを統一し実装し、10種類の多ターンゴール指向およびシングルターン推論およびQAデータセットで評価する。ベンチマーク経験の再利用性を向上するために,事前経験の検索と活用のためのベースライン手法ExpRAGを提供し,さらに推論やタスクアクション,メモリ更新を緊密に統合して継続的な改善を実現する,アクション思考型洗練されたパイプラインReMemを提案する。

論文の概要: Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

関連論文リスト