Fugu-MT 論文翻訳(概要): CL-bench Life: Can Language Models Learn from Real-Life Context?

論文の概要: CL-bench Life: Can Language Models Learn from Real-Life Context?

arxiv url: http://arxiv.org/abs/2604.27043v1
Date: Wed, 29 Apr 2026 17:44:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 16:31:53.737827
Title: CL-bench Life: Can Language Models Learn from Real-Life Context?
Title（参考訳）: CLベンチライフ:言語モデルは実生活の文脈から学ぶことができるか?
Authors: Shihan Dou, Yujiong Shen, Chenhao Huang, Junjie Ye, Jiayi Chen, Junzhe Wang, Qianyu He, Shichun Liu, Changze Lv, Jiahang Lin, Jiazheng Zhang, Ming Zhang, Shaofan Liu, Tao Ji, Zhangyue Yin, Cheng Zhang, Huaibing Xie, Jianglu Hu, Jingcheng Deng, Lincheng Li, Minda Hu, Shaolei Wang, Syrus Zhao, Weichao Wang, Yan Lei, Yang Liu, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Ziliang Zhao, Pluto Zhou, Tao Gui, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Di Wang, Shunyu Yao,
Abstract要約: CL-bench Lifeは、405のコンテキストタスクペアと5,348の検証からなるベンチマークである。我々は,10のフロンティアLMを評価し,実生活における文脈学習が極めて困難であることを見出した。 CL-bench Lifeは、現実の文脈学習を前進させるための重要なテストベッドを提供する。
参考スコア（独自算出の注目度）: 123.88885809530706
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Today's AI assistants such as OpenClaw are designed to handle context effectively, making context learning an increasingly important capability for models. As these systems move beyond professional settings into everyday life, the nature of the contexts they must handle also shifts. Real-life contexts are often messy, fragmented, and deeply tied to personal and social experience, such as multi-party conversations, personal archives, and behavioral traces. Yet it remains unclear whether current frontier language models can reliably learn from such contexts and solve tasks grounded in them. To this end, we introduce CL-bench Life, a fully human-curated benchmark comprising 405 context-task pairs and 5,348 verification rubrics, covering common real-life scenarios. Solving tasks in CL-bench Life requires models to reason over complex, messy real-life contexts, calling for strong real-life context learning abilities that go far beyond those evaluated in existing benchmarks. We evaluate ten frontier LMs and find that real-life context learning remains highly challenging: even the best-performing model achieves only 19.3% task solving rate, while the average performance across models is only 13.8%. Models still struggle to reason over contexts such as messy group chat histories and fragmented behavioral records from everyday life. CL-bench Life provides a crucial testbed for advancing real-life context learning, and progress on it can enable more intelligent and reliable AI assistants in everyday life.
Abstract（参考訳）: 今日のOpenClawのようなAIアシスタントは、コンテキストを効果的に扱うように設計されており、コンテキスト学習がモデルにとってますます重要な機能になっている。これらのシステムは、専門的な設定を超えて日々の生活へと移行するので、それらが扱うコンテキストの性質も変化します。実生活の文脈は、しばしば混乱し、断片化され、多人数の会話、個人のアーカイブ、行動の痕跡など、個人的および社会的経験と深く結びついている。しかし、現在のフロンティア言語モデルがそのような文脈から確実に学習し、それらに根ざしたタスクを解決できるかどうかは不明だ。 CL-bench Lifeは,405組のコンテキストタスクペアと5,348組の検証ルーリックからなる完全人為的なベンチマークである。 CL-bench Lifeでのタスクの解決には、複雑な実生活のコンテキストを推論するモデルが必要である。我々は10つのフロンティアLMを評価し、実生活における文脈学習は依然として非常に困難なままであり、最高のパフォーマンスモデルでさえ19.3%のタスク解決率しか達成していないのに対し、モデル全体の平均性能は13.8%に過ぎない。モデルは、乱雑なグループチャット履歴や、日々の行動記録の断片化など、コンテキストの推論に苦慮している。 CL-bench Lifeは、現実の文脈学習を進める上で重要なテストベッドを提供する。

論文の概要: CL-bench Life: Can Language Models Learn from Real-Life Context?

関連論文リスト