CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark
- URL: http://arxiv.org/abs/2507.05281v1
- Date: Fri, 04 Jul 2025 09:42:04 GMT
- Title: CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark
- Authors: Lingyue Fu, Hao Guan, Bolun Zhang, Haowei Yuan, Yaoming Zhu, Jun Xu, Zongyu Wang, Lin Qiu, Xunliang Cai, Xuezhi Cao, Weiwen Liu, Weinan Zhang, Yong Yu,
- Abstract summary: Large Language Models (LLMs) demonstrate increasingly sophisticated code processing capabilities.<n> evaluating their performance on engineering-level code remains challenging.<n>Existing repository-level benchmarks primarily focus on single scenarios, such as code generation or bug fixing.<n>We present CorePipe, a fully automated pipeline that converts repositories into comprehensive test cases.
- Score: 36.535790823814516
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As Large Language Models (LLMs) demonstrate increasingly sophisticated code processing capabilities, evaluating their performance on engineering-level code remains challenging. Existing repository-level benchmarks primarily focus on single scenarios, such as code generation or bug fixing, without adequately capturing the diversity and complexity of real-world software or project engineering workflows. Furthermore, these benchmarks suffer from limited controllability in question positioning and reliability issues in their generated test cases. To address these limitations, we present CorePipe, a fully automated pipeline that converts repositories into comprehensive test cases, and introduce CoreCodeBench, a configurable multi-scenario repository-level benchmark. To simulate real engineering scenarios, CorePipe generates three types of atomic questions (Development, BugFix, and Test-Driven Development) specifically targeting core code segments. These atomic questions are further combined into three types of composite questions, with difficulty levels flexibly adjusted through hyperparameter tuning. CoreCodeBench provides a comprehensive and extensive repository-level benchmark to investigate the applicability of LLMs in real-world engineering projects. Experiments with 16 LLMs across diverse scenarios reveal varying capabilities and offer multi-dimensional insights into LLM performance in engineering contexts. The code for CorePipe is available at https://github.com/AGI-Eval-Official/CoreCodeBench, and the data for CoreCodeBench can be accessed at https://huggingface.co/collections/tubehhh/corecodebench-68256d2faabf4b1610a08caa.
Related papers
- SecCodeBench-V2 Technical Report [43.10947096543533]
We introduce SecCodeBench-V2, a benchmark for evaluating Large Language Model (LLM) copilots' capabilities of generating secure code.<n>SecCodeBench-V2 comprises 98 generation and fix scenarios derived from Alibaba Group's industrial productions.<n>For each scenario, SecCodeBench-V2 provides executable proof-of-concept (PoC) test cases for both functional validation and security verification.
arXiv Detail & Related papers (2026-02-17T10:47:06Z) - ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development [49.63491095660809]
ProjDevBench is an end-to-end benchmark that provides project requirements to coding agents and evaluates the resulting repositories.<n>We curate 20 programming problems across 8 categories, covering both concept-oriented tasks and real-world application scenarios.<n>Our evaluation reports an overall acceptance rate of 27.38%: agents handle basic functionality but struggle with complex system design, time optimization, and resource management.
arXiv Detail & Related papers (2026-02-02T05:17:23Z) - EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis [101.67583081810136]
Large language models (LLMs) are expected to be trained to act as agents in various real-world environments.<n>This process relies on rich and varied tool-interaction sandboxes.<n>We propose EnvScaler, an automated framework for scalable tool-interaction environments.
arXiv Detail & Related papers (2026-01-09T14:32:06Z) - MRG-Bench: Evaluating and Exploring the Requirements of Context for Repository-Level Code Generation [0.7342677574855649]
We introduce textbfMRG-Bench, a novel dataset that provides a more accurate evaluation of large language models.<n>We conduct experiments including large language models, long-context models, and RAG-related methods.<n>Results show that the majority of methods suffer from "textbfdifficulty in understanding user requirements," failing to comprehend their assigned tasks accurately.
arXiv Detail & Related papers (2025-08-05T01:53:45Z) - CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance [18.886738819470086]
We introduce CodeAssistBench (CAB), the first benchmark framework for evaluating multi-turn programming assistance.<n>Unlike existing programming Q&A benchmarks, CAB automatically generates scalable datasets from question-related GitHub issues.<n>Using this framework, we constructed a test set of 3,286 real-world programming questions across 231 repositories.
arXiv Detail & Related papers (2025-07-14T17:19:00Z) - SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving [90.32201622392137]
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs)<n>Unlike traditional static benchmarks, SwingArena models the collaborative process of software by pairing LLMs as iterations, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines.
arXiv Detail & Related papers (2025-05-29T18:28:02Z) - FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation [26.14778133391999]
FEA-Bench is a benchmark designed to assess the ability of large language models to perform incremental development within code repositories.<n>We collect pull requests from 83 GitHub repositories and use rule-based and intent-based filtering to construct task instances focused on new feature development.
arXiv Detail & Related papers (2025-03-09T16:11:57Z) - CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.<n>The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z) - Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework.
Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z) - ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code [29.178248778212588]
ComplexCodeEval is a benchmark designed to assess large language models (LLMs) in various development tasks.
It includes 3,897 Java samples and 7,184 Python samples from high-star GitHub repositories.
arXiv Detail & Related papers (2024-09-16T13:43:04Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.<n>We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.<n>We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - On the Impacts of Contexts on Repository-Level Code Generation [5.641402231731082]
We present RepoExec, a novel benchmark designed to evaluate repository-level code generation.<n>We focus on three key aspects: executability, functional correctness through comprehensive test case generation, and accurate utilization of cross-file contexts.
arXiv Detail & Related papers (2024-06-17T10:45:22Z) - Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study [72.24266814625685]
We explore the performance of large language models (LLMs) across the entire software development lifecycle with DevEval.<n>DevEval features four programming languages, multiple domains, high-quality data collection, and carefully designed and verified metrics for each task.<n> Empirical studies show that current LLMs, including GPT-4, fail to solve the challenges presented within DevEval.
arXiv Detail & Related papers (2024-03-13T15:13:44Z) - InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge.
It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages.
We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.