Fugu-MT 論文翻訳(概要): Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents

論文の概要: Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents

arxiv url: http://arxiv.org/abs/2605.21768v1
Date: Wed, 20 May 2026 22:02:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 20:14:18.495188
Title: Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents
Title（参考訳）: Memory-R2: 長期記憶増強LDMエージェントの公平なクレジットアサインメント
Authors: Sikuan Yan, Ahmed Bahloul, Ercong Nie, Susanna Schwarzmann, Riccardo Trivisonno, Volker Tresp, Yunpu Ma,
Abstract要約: メモリ拡張LDMエージェントは、有限コンテキストウィンドウを超えて拡張されるインタラクションを可能にする。マルチセッション環境における強化学習によるエージェントの育成は困難である。メモリ拡張LDMエージェントのトレーニングフレームワークであるMemory-R2を紹介する。
参考スコア（独自算出の注目度）: 27.2861945963127
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Memory-augmented LLM agents enable interactions that extend beyond finite context windows by storing, updating, and reusing information across sessions. However, training such agents with reinforcement learning in multi-session environments is challenging because memory turns the agent's past actions into part of its future environment. Once different rollouts write, update, or delete different memories, they no longer share the same intermediate memory state, making trajectory-level comparisons fundamentally unfair. This violates a key assumption behind group-relative methods such as GRPO, where rollouts are compared as if they were sampled from the same effective environment. Consequently, trajectory-level rewards provide noisy or biased credit signals for long-horizon memory operations. To address this challenge, we introduce Memory-R2, a training framework for long-horizon memory-augmented LLM agents. Its core algorithm, LoGo-GRPO, combines local and global group-relative optimization. The global objective preserves end-to-end learning from long-horizon trajectory-level rewards, while local rerollouts compare different memory-operation outcomes from the same intermediate memory state, yielding fairer group comparisons and more precise supervision for memory construction. Beyond credit assignment, Memory-R2 jointly optimizes memory formation and memory evolution with a shared-parameter co-learning design, where a fact extractor and a memory manager are instantiated from the same LLM backbone through role-specific prompts. To stabilize multi-step RL over long memory horizons, we adopt a progressive curriculum that increases the training horizon from 8 to 16 to 32 sessions. Together, these components provide an effective training paradigm for memory-augmented LLM agents in long-horizon multi-session settings.
Abstract（参考訳）: メモリ拡張LDMエージェントは、セッション間で情報を保存、更新、再利用することで、有限コンテキストウィンドウを超えて広がるインタラクションを可能にする。しかし, マルチセッション環境における強化学習によるエージェントの訓練は, 記憶がエージェントの過去の行動を将来の環境の一部に変えるため, 困難である。一度異なるロールアウトが異なるメモリを書き込み、更新、削除すると、同じ中間メモリ状態を共有しなくなり、トラジェクトリレベルの比較が根本的に不公平になる。これはGRPOのようなグループ相対的手法の背後にある重要な仮定に反し、同じ有効環境からサンプリングされたようなロールアウトを比較する。その結果、トラジェクトリレベルの報酬は、長期記憶操作にノイズやバイアスのあるクレジット信号を与える。この課題に対処するために,長期メモリ拡張LDMエージェントのトレーニングフレームワークであるMemory-R2を紹介する。そのコアアルゴリズムであるLoGo-GRPOは、局所的およびグローバルなグループ相対最適化を組み合わせたものである。グローバルな目的は、長期の軌跡レベルの報酬からのエンドツーエンドの学習を保存し、局所的な学習は、同じ中間記憶状態から異なるメモリ操作結果を比較し、より公平なグループ比較とより正確なメモリ構築の監督を与える。メモリ-R2は、メモリ生成とメモリ進化を共有パラメータのコラーニング設計で共同で最適化し、ファクト抽出器とメモリマネージャがロール固有のプロンプトを通じて同じLLMバックボーンからインスタンス化される。長期記憶地平線上での多段階RLの安定化には,8から16から32のセッションにトレーニング地平線を拡大するプログレッシブカリキュラムを採用する。これらのコンポーネントは、長期のマルチセッション設定において、メモリ拡張LDMエージェントに対して効果的なトレーニングパラダイムを提供する。

論文の概要: Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents

関連論文リスト