Fugu-MT 論文翻訳(概要): What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

論文の概要: What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

arxiv url: http://arxiv.org/abs/2605.23067v1
Date: Thu, 21 May 2026 21:58:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 17:29:20.116861
Title: What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA
Title（参考訳）: 学習データがRL記憶剤にどのような影響を及ぼすか:記憶増強QAにおけるカリキュラム効果の実証的研究
Authors: Xinjie He, Zhiyuan Lin, Su Liu, Jialun Wu, Qiyang Xie, Weikai Zhou, Shuai Xiao,
Abstract要約: 強化学習(Reinforcement Learning, RL)は、マルチセッション対話において、外部記憶バンクを推論するためにLLMエージェントを訓練するための実行可能なレシピとして登場した。本稿では、アーキテクチャ、RLアルゴリズム、および全てのハイパーパラメータを固定し、3つの条件でトレーニングカリキュラムだけを変化させる制御された経験的研究について述べる。 2つのベンチマークと10の質問タイプにまたがって、カリキュラム構成はパフォーマンスの均一なスケーリングファクタではなく、特殊化の細かいレバーとして機能する。
参考スコア（独自算出の注目度）: 6.180594609315985
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) has emerged as a viable recipe for training LLM agents to reason over external memory banks in multi-session dialogue. Existing work trains exclusively on a single benchmark, leaving open how the composition of training data shapes the skills a memory agent acquires. We present a controlled empirical study that holds architecture, RL algorithm, and all hyperparameters fixed and varies only the training curriculum across three conditions: in-domain (LoCoMo), mixed-benchmark (LoCoMo + LongMemEval), and out-of-domain (LongMemEval only). Across two benchmarks and ten question types, curriculum composition acts as a fine-grained lever on specialization rather than a uniform scaling factor on performance. The mixed curriculum yields the strongest overall F1 on both evaluation sets. Training on a narrow out-of-domain set transfers a targeted skill - temporal reasoning - despite weak aggregate performance. Per-type differences substantially exceed aggregate differences, indicating that single-number benchmark comparisons systematically underreport curriculum effects. We further report two practical lessons from adapting GRPO to a single-GPU regime: cross-benchmark mixing requires filtering format-specific noise from memory banks to preserve training signal, and binary exact-match reward produces no learning signal at the small group sizes (G = 4) required on one GPU, motivating continuous reward functions in this regime.
Abstract（参考訳）: 強化学習(Reinforcement Learning, RL)は、マルチセッション対話において、外部記憶バンクを推論するためにLLMエージェントを訓練するための実行可能なレシピとして登場した。既存のワークトレーナーは単一のベンチマークでのみトレーニングを行い、トレーニングデータの構成がメモリエージェントが取得したスキルをどのように形成するかをオープンにする。アーキテクチャ、RLアルゴリズム、およびすべてのハイパーパラメータを固定し、トレーニングカリキュラムを3つの条件(LoCoMo)、混合ベンチマーク(LoCoMo + LongMemEval)、アウト・オブ・ドメイン(LongMemEval のみ)で変更する。 2つのベンチマークと10の質問タイプにまたがって、カリキュラム構成はパフォーマンスの均一なスケーリングファクタではなく、特殊化の細かいレバーとして機能する。混合カリキュラムは、両方の評価セットで最強の総合F1を得る。狭い領域外セットのトレーニングは、低い集約パフォーマンスにもかかわらず、ターゲットスキル(時間的推論)を転送する。単数ベンチマークの比較は、カリキュラム効果を体系的に過小評価していることを示している。クロスベンチマークミキシングは、トレーニング信号を保持するためにメモリバンクからフォーマット固有のノイズをフィルタリングすることを必要とし、バイナリの完全マッチ報酬は、1つのGPUで必要となる小さなグループサイズ(G = 4)で学習信号を生成せず、このシステムで連続的な報酬関数を動機付ける。

論文の概要: What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

関連論文リスト