Fugu-MT 論文翻訳(概要): Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

論文の概要: Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

arxiv url: http://arxiv.org/abs/2606.05661v1
Date: Thu, 04 Jun 2026 03:43:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.530761
Title: Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments
Title（参考訳）: 継続的学習ベンチ: 現実のステートフル環境におけるフロンティアAIシステムの評価
Authors: Parth Asawa, Christopher M. Glaze, Gabriel Orlanski, Ramya Ramakrishnan, Benji Xu, Asim Biswal, Vincent Sunn Chen, Frederic Sala, Matei Zaharia, Joseph E. Gonzalez,
Abstract要約: 我々は,AIシステムが実体験で真に改善するかどうかを測定する最初のベンチマークであるContinuous Learning Bench(CL-Bench)を紹介する。 CL-Benchは6つの異なるドメイン(ソフトウェアエンジニアリング、信号処理、病気発生予測、データベースクエリ、戦略的ゲームプレイ、需要予測)にまたがる。そこで本研究では,テキスト内学習(ICL)から専用メモリシステムまで,複数のエージェントアーキテクチャを対象としたフロンティアモデルの評価を行った。
参考スコア（独自算出の注目度）: 44.90458129179607
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-Bench), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience. CL-Bench spans six diverse domains (software engineering, signal processing, disease outbreak forecasting, database querying, strategic game-playing, and demand forecasting), each validated by domain experts and designed so that tasks share a learnable latent structure (codebase layout, disease outbreak dynamics, opponent strategies) that a stateful system can discover online but a stateless one cannot. We evaluate frontier models across several agent architectures, from naive in-context learning (ICL) to dedicated memory systems, introducing a gain metric to isolate learning from prior capabilities. We find that these systems leave headroom for improved continual learning: agents frequently overfit to immediate observations or fail to reuse knowledge across instances, and dedicated memory systems do not fix this -- in fact, naive ICL outperforms systems dedicated to memory management. CL-Bench is the first benchmark to evaluate continual learning across diverse real-world domains with expert-validated tasks and isolate online learning from underlying model capability, showing a need for better continual learning systems.
Abstract（参考訳）: 逐次的な経験を通じて改善するAIシステムの能力である継続的学習は、かなりの関心を集めているが、それを評価するための高品質なベンチマークは存在しない。 CL-Bench(Continuous Learning Bench)は,LSMベースのシステムが実体験で真に改善するかどうかを測定するために設計された,最初の難易度の高いベンチマークである。 CL-Benchは6つの多様なドメイン(ソフトウェアエンジニアリング、信号処理、病気発生予測、データベースクエリ、戦略的ゲームプレイング、需要予測)にまたがっており、それぞれドメインの専門家によって検証され、タスクが学習可能な潜在構造(コードベースのレイアウト、疾患発生のダイナミクス、反対戦略)を共有するように設計されている。そこで本研究では,テキスト内学習(ICL)から専用メモリシステムまで,複数のエージェントアーキテクチャを対象としたフロンティアモデルの評価を行った。エージェントは即時観察に過度に適したり、インスタンス間での知識の再利用に失敗することが多く、専用のメモリシステムはこれを修正しない -- 実際、ICLはメモリ管理専用のシステムよりも優れています。 CL-Benchは、エキスパート検証されたタスクを持つさまざまな現実世界のドメインにわたる継続的学習を評価し、基礎となるモデル能力からオンライン学習を分離する最初のベンチマークであり、継続的な学習システムの改善の必要性を示している。

論文の概要: Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

関連論文リスト