Fugu-MT 論文翻訳(概要): AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

論文の概要: AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

arxiv url: http://arxiv.org/abs/2606.02461v2
Date: Tue, 02 Jun 2026 03:07:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-03 18:57:50.563443
Title: AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents
Title（参考訳）: AgentCL:言語エージェントにおける連続学習の厳密な評価に向けて
Authors: Yiheng Shu, Bernal Jiménez Gutiérrez, Saisri Padmaja Jonnalagedda, Yuguang Yao, Huan Sun, Yu Su,
Abstract要約: 継続的な学習は、エージェントが一連のタスクに再利用可能な経験を蓄積し、時間とともに改善し、無関係な経験からの干渉を避けることを期待する。ほとんどの取り組みは、長いコンテキストの会話やドキュメントに対する検索と推論に重点を置いているが、最近の長命適応ベンチマークは、しばしば単純なタスクストリームに依存している。本稿では、制御されたタスクストリームと転送利得のメトリクスに着目した連続学習エージェントのための評価フレームワークAgentCLを提案する。
参考スコア（独自算出の注目度）: 30.801952443449633
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilized in future episodes. Continual learning expects an agent to accumulate reusable experience across a stream of tasks, improve over time, and avoid interference from irrelevant experiences. Unfortunately, existing benchmarks struggle to evaluate continual learning in language agents rigorously. Most efforts focus on retrieval and reasoning over long-context conversations or documents, while recent lifelong-adaptation benchmarks often rely on naive task streams with limited analysis of cross-task relationships, making it difficult to understand what an agent learns and reuses over time. This paper presents an evaluation framework AgentCL for continual learning in agents, centered on controlled task streams and metrics for transfer gains. AgentCL constructs compositional streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, and contrasts them with naive streams where such reusability is not guaranteed. We use the benchmark to evaluate non-parametric memory designs for continual learning. To diagnose how memory design choices affect continual learning, we develop MemProbe, a probing method that stores interactions, insights, and skills, while filtering unreliable experiences during consolidation. Empirical analysis across coding, deep research, and language understanding/reasoning tasks shows that naive streams offer limited ability to distinguish memory designs, whereas controlled streams more clearly distinguish their plasticity. Meanwhile, naive and held-out settings often yield limited gains and can expose memory-induced degradation. These results highlight the need for stronger memory designs that balance plasticity and stable reuse.
Abstract（参考訳）: 言語エージェントは個々のタスクの解決に相当な推論時間を費やすが、あるエピソードで得られた経験は将来のエピソードでは利用されないことが多い。継続的な学習は、エージェントが一連のタスクに再利用可能な経験を蓄積し、時間とともに改善し、無関係な経験からの干渉を避けることを期待する。残念ながら、既存のベンチマークは言語エージェントの継続的な学習を厳格に評価するのに苦労している。長いコンテキストの会話やドキュメントに対する検索と推論に重点を置いているのに対して、最近の長命適応ベンチマークでは、タスク間の関係を限定的に分析した単純なタスクストリームに頼っている場合が多いため、エージェントが学習し、再利用することが時間の経過とともに難しくなる。本稿では,エージェントにおける連続学習のための評価フレームワークであるAgentCLについて述べる。 AgentCLは、初期のサブソリューション、エビデンス、ワークフローが後続のタスクで意図的に再利用されるコンポジションストリームを構築し、そのような再利用性が保証されていない単純なストリームと対比する。我々はこのベンチマークを用いて、連続学習のための非パラメトリックメモリ設計を評価する。メモリ設計の選択が連続学習にどう影響するかを診断するために,統合中に信頼できない経験をフィルタリングしながら,インタラクションや洞察,スキルを記憶する探索手法であるMemProbeを開発した。コーディング、ディープリサーチ、言語理解/推論タスクにまたがる経験的分析は、ナイーブストリームがメモリ設計を区別する能力に制限があることを示しているが、制御されたストリームは、その可塑性をより明確に区別する。一方、ナイーブとホールドアウトの設定は、しばしば限られたゲインをもたらし、メモリが引き起こす劣化を露呈する。これらの結果は、可塑性と安定した再利用のバランスをとる強力なメモリ設計の必要性を強調している。

論文の概要: AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

関連論文リスト