CL4SE: A Context Learning Benchmark For Software Engineering Tasks
- URL: http://arxiv.org/abs/2602.23047v1
- Date: Thu, 26 Feb 2026 14:28:57 GMT
- Title: CL4SE: A Context Learning Benchmark For Software Engineering Tasks
- Authors: Haichuan Hu, Ye Shang, Guoqing Xie, Congqing He, Quanjun Zhang,
- Abstract summary: Context engineering has emerged as a pivotal paradigm for unlocking the potential of Large Language Models (LLMs) in Software Engineering (SE) tasks.<n>Existing research lacks a systematic taxonomy of SE-specific context types and a dedicated benchmark to quantify the effects of different contexts.<n>We propose CL4SE (Context Learning for Software Engineering), a comprehensive benchmark featuring a fine-grained taxonomy of four SE-oriented context types.<n>We construct high-quality datasets comprising over 13,000 samples from more than 30 open-source projects and evaluate five mainstreams across nine metrics.
- Score: 7.899464362501583
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Context engineering has emerged as a pivotal paradigm for unlocking the potential of Large Language Models (LLMs) in Software Engineering (SE) tasks, enabling performance gains at test time without model fine-tuning. Despite its success, existing research lacks a systematic taxonomy of SE-specific context types and a dedicated benchmark to quantify the heterogeneous effects of different contexts across core SE workflows. To address this gap, we propose CL4SE (Context Learning for Software Engineering), a comprehensive benchmark featuring a fine-grained taxonomy of four SE-oriented context types (interpretable examples, project-specific context, procedural decision-making context, and positive & negative context), each mapped to a representative task (code generation, code summarization, code review, and patch correctness assessment). We construct high-quality datasets comprising over 13,000 samples from more than 30 open-source projects and evaluate five mainstream LLMs across nine metrics. Extensive experiments demonstrate that context learning yields an average performance improvement of 24.7% across all tasks. Specifically, procedural context boosts code review performance by up to 33% (Qwen3-Max), mixed positive-negative context improves patch assessment by 30% (DeepSeek-V3), project-specific context increases code summarization BLEU by 14.78% (GPT-Oss-120B), and interpretable examples enhance code generation PASS@1 by 5.72% (DeepSeek-V3). CL4SE establishes the first standardized evaluation framework for SE context learning, provides actionable empirical insights into task-specific context design, and releases a large-scale dataset to facilitate reproducible research in this domain.
Related papers
- Private PoEtry: Private In-Context Learning via Product of Experts [58.496468062236225]
In-context learning (ICL) enables Large Language Models to adapt to new tasks with only a small set of examples at inference time.<n>Existing differential privacy approaches to ICL are either computationally expensive or rely on oversampling, synthetic data generation, or unnecessary thresholding.<n>We reformulate private ICL through the lens of a Product-of-Experts model. This gives a theoretically grounded framework, and the algorithm can be trivially parallelized.<n>We find that our method improves accuracy by more than 30 percentage points on average compared to prior DP-ICL methods, while maintaining strong privacy guarantees.
arXiv Detail & Related papers (2026-02-04T19:56:24Z) - Assessing and Improving the Representativeness of Code Generation Benchmarks Using Knowledge Units (KUs) of Programming Languages -- An Empirical Study [7.0773305889955616]
Large Language Models (LLMs) have shown impressive performance in code generation.<n>LLMs must understand and apply a wide range of language concepts.<n>If the concepts exercised in benchmarks are not representative of those used in real-world projects, evaluations may yield incomplete.
arXiv Detail & Related papers (2026-01-07T10:23:33Z) - CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks [53.88194225946438]
Chain-of-Thought for Detection (CoT4Det) is a simple but efficient strategy that reformulates perception tasks into three interpretable steps.<n>We show that CoT4Det significantly improves perception performance without compromising general vision language capabilities.
arXiv Detail & Related papers (2025-12-07T05:26:30Z) - Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation [3.9189409002585567]
Large language models (LLMs) have demonstrated strong performance on function-level code generation benchmarks.<n>We introduce a benchmark derived from real-world open-source repositories to evaluate generalization under practical conditions.<n>We examine how input specification completeness and retrieval-augmented generation affect class-level correctness across multiple state-of-the-art LLMs.
arXiv Detail & Related papers (2025-10-30T04:30:23Z) - Clarifying Before Reasoning: A Coq Prover with Structural Context [13.273599284897411]
We introduce a concept-level metric to evaluate task clarity and show that adding structured semantic context leads to a 1.85$times$ improvement in clarity score.<n>We evaluate this on 1,386 theorems randomly sampled from 15 standard Coq packages.
arXiv Detail & Related papers (2025-07-03T11:35:34Z) - A Hierarchical and Evolvable Benchmark for Fine-Grained Code Instruction Following with Multi-Turn Feedback [30.446511584123492]
Large language models (LLMs) have advanced significantly in code generation, yet their ability to follow complex programming instructions with layered and diverse constraints remains underexplored.<n>We introduce MultiCodeIF, a comprehensive benchmark designed to evaluate instruction-following in code generation across multiple dimensions.<n>We synthesize and evolve 2,021 code tasks sourced from 14 programming languages, supporting multi-turn evaluation through feedback-driven task variants.
arXiv Detail & Related papers (2025-07-01T11:51:40Z) - CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.<n>The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z) - Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z) - Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation.
We propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation.
arXiv Detail & Related papers (2023-10-15T06:12:58Z) - Larger-Context Tagging: When and Why Does It Work? [55.407651696813396]
We focus on investigating when and why the larger-context training, as a general strategy, can work.
We set up a testbed based on four tagging tasks and thirteen datasets.
arXiv Detail & Related papers (2021-04-09T15:35:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.