Fugu-MT 論文翻訳(概要): Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory

論文の概要: Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory

arxiv url: http://arxiv.org/abs/2605.15384v1
Date: Thu, 14 May 2026 20:15:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-18 21:22:26.086931
Title: Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory
Title（参考訳）: 1スコアは十分か? : LLMメモリの逐次進化の評価を再考する
Authors: Songwei Dong, Zihan Chen, Chengshuai Shi, Peng Wang, Jundong Li, Cong Shen,
Abstract要約: 本稿では,大規模言語モデル(LLM)メモリを逐次進化させる診断評価フレームワークであるSeqMem-Evalを紹介する。最終的なパフォーマンスのみに焦点を当てるのではなく、SeqMem-Evalは、シーケンシャル推論において、メモリ状態がどのように進化し、一般化し、エクスペリエンスを集約し、有用な情報を保持するかを評価する。
参考スコア（独自算出の注目度）: 50.857546269660276
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Memory plays a central role in enabling large language models (LLMs) to operate over sequential tasks by accumulating and reusing experience over time. However, existing evaluations of LLM memory mostly rely on aggregate metrics such as final hold-out accuracy or cumulative online performance, which can obscure critical failure modes such as forgetting and negative transfer. In this paper, we introduce SeqMem-Eval, a diagnostic evaluation framework for sequentially evolving LLM memory. Drawing inspiration from continual learning, it targets a test-time setting in which memory is external, prompt-mediated, and updated without modifying model parameters. Rather than focusing only on final performance, SeqMem-Eval evaluates how memory states evolve, generalize, consolidate experience, and retain useful information during sequential inference. Specifically, it measures online utility, hold-out generalization, backward transfer, and forgetting, providing a finer-grained view of memory quality. Through extensive experiments across diverse tasks and memory methods, we show that higher final or cumulative accuracy does not necessarily imply better memory quality: many methods exhibit strong performance gains while suffering from substantial forgetting or negative transfer. Moreover, different memory designs exhibit distinct trade-offs between adaptability and stability that remain invisible under standard evaluation metrics.
Abstract（参考訳）: メモリは、大規模な言語モデル(LLM)が、時間の経過とともに経験を蓄積し再利用することによって、シーケンシャルなタスクを操作できるようにする上で、中心的な役割を果たす。しかし、LCMメモリの既存の評価は、最終的なホールドアウト精度や累積オンライン性能などの集計基準に大きく依存しており、これは、忘れたり、負の転送といった致命的な障害モードを曖昧にする可能性がある。本稿では,LCMメモリを逐次進化させる診断評価フレームワークであるSeqMem-Evalを紹介する。継続的な学習からインスピレーションを得て、モデルパラメータを変更することなく、メモリが外部にあり、プロンプト介在し、更新されるテスト時間設定をターゲットにしている。最終的なパフォーマンスのみに焦点を当てるのではなく、SeqMem-Evalは、シーケンシャル推論において、メモリ状態がどのように進化し、一般化し、エクスペリエンスを集約し、有用な情報を保持するかを評価する。具体的には、オンラインユーティリティ、ホールドアウト一般化、後方転送、および忘れを計測し、メモリ品質のよりきめ細かいビューを提供する。多様なタスクやメモリメソッドにわたる広範な実験を通して、高い最終精度や累積精度が必ずしもメモリ品質を向上させるとは限らないことが示される。さらに、異なるメモリ設計は、標準評価基準の下では見えない適応性と安定性の間に明確なトレードオフを示す。

論文の概要: Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory

関連論文リスト