Related papers: CL-bench: A Benchmark for Context Learning

CL-bench: A Benchmark for Context Learning

URL: http://arxiv.org/abs/2602.03587v1
Date: Tue, 03 Feb 2026 14:37:47 GMT
Title: CL-bench: A Benchmark for Context Learning
Authors: Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Di Wang, Shunyu Yao,
Abstract summary: We term this capability context learning, a crucial ability that humans naturally possess but has been largely overlooked.<n>We introduce CL-bench, a real-world benchmark consisting of 500 contexts, 1,899 tasks, and 31,607 verifications.<n> CL-bench represents a step towards building LMs with this fundamental capability, making them more intelligent and advancing their deployment in real-world scenarios.
Score: 152.2879060355882
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current language models (LMs) excel at reasoning over prompts using pre-trained knowledge. However, real-world tasks are far more complex and context-dependent: models must learn from task-specific context and leverage new knowledge beyond what is learned during pre-training to reason and resolve tasks. We term this capability context learning, a crucial ability that humans naturally possess but has been largely overlooked. To this end, we introduce CL-bench, a real-world benchmark consisting of 500 complex contexts, 1,899 tasks, and 31,607 verification rubrics, all crafted by experienced domain experts. Each task is designed such that the new content required to resolve it is contained within the corresponding context. Resolving tasks in CL-bench requires models to learn from the context, ranging from new domain-specific knowledge, rule systems, and complex procedures to laws derived from empirical data, all of which are absent from pre-training. This goes far beyond long-context tasks that primarily test retrieval or reading comprehension, and in-context learning tasks, where models learn simple task patterns via instructions and demonstrations. Our evaluations of ten frontier LMs find that models solve only 17.2% of tasks on average. Even the best-performing model, GPT-5.1, solves only 23.7%, revealing that LMs have yet to achieve effective context learning, which poses a critical bottleneck for tackling real-world, complex context-dependent tasks. CL-bench represents a step towards building LMs with this fundamental capability, making them more intelligent and advancing their deployment in real-world scenarios.

Related papers

Beyond Manuals and Tasks: Instance-Level Context Learning for LLM Agents [22.620674535292068]
Large language model (LLM) agents typically receive two kinds of context: environment-level manuals that define interaction interfaces and global rules, and task-level guidance or demonstrations tied to specific goals.<n>We argue that the absence of instance-level context is a common source of failure for LLM agents in complex tasks.<n>We formalize this problem as Instance-Level Context Learning (ILCL) and introduce our task-agnostic method to solve it.
arXiv Detail & Related papers (2025-09-29T05:38:51Z)
Learning Task Representations from In-Context Learning [67.66042137487287]
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning (ICL)<n>We introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads.<n>The proposed method successfully extracts task-specific information from in-context demonstrations and excels in both text and regression tasks.
arXiv Detail & Related papers (2025-02-08T00:16:44Z)
VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks [100.3234156027118]
We present VLABench, an open-source benchmark for evaluating universal LCM task learning.<n>VLABench provides 100 carefully designed categories of tasks, with strong randomization in each category of task and a total of 2000+ objects.<n>The benchmark assesses multiple competencies including understanding of mesh&texture, spatial relationship, semantic instruction, physical laws, knowledge transfer and reasoning.
arXiv Detail & Related papers (2024-12-24T06:03:42Z)
On Many-Shot In-Context Learning for Long-Context Evaluation [10.500629810624769]
This paper delves into long-context language model evaluation through many-shot ICL.<n>We develop metrics to categorize ICL tasks into two groups: similar-sample learning (SSL) and all-sample learning (ASL)<n>We find that while state-of-the-art models demonstrate good performance up to 64k tokens in SSL tasks, many models experience significant performance drops at only 16k tokens in ASL tasks.
arXiv Detail & Related papers (2024-11-11T17:00:59Z)
Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack [33.178008350124315]
We introduce Lifelong ICL, a problem setting that challenges long-context language models (LMs) to learn a sequence of language tasks through in-context learning (ICL)<n>We introduce Task Haystack, an evaluation suite dedicated to assessing and diagnosing how long-context LMs utilize contexts in Lifelong ICL.
arXiv Detail & Related papers (2024-07-23T17:57:41Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models [68.18370230899102]
We investigate how to elicit compositional generalization capabilities in large language models (LLMs) We find that demonstrating both foundational skills and compositional examples grounded in these skills within the same prompt context is crucial. We show that fine-tuning LLMs with SKiC-style data can elicit zero-shot weak-to-strong generalization.
arXiv Detail & Related papers (2023-08-01T05:54:12Z)
Pre-Training to Learn in Context [138.0745138788142]
The ability of in-context learning is not fully exploited because language models are not explicitly trained to learn in context. We propose PICL (Pre-training for In-Context Learning), a framework to enhance the language models' in-context learning ability. Our experiments show that PICL is more effective and task-generalizable than a range of baselines, outperforming larger language models with nearly 4x parameters.
arXiv Detail & Related papers (2023-05-16T03:38:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.