Related papers: FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation

FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation

URL: http://arxiv.org/abs/2503.06680v1
Date: Sun, 09 Mar 2025 16:11:57 GMT
Title: FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation
Authors: Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, Scarlett Li,
Abstract summary: FEA-Bench is a benchmark designed to assess the ability of large language models to perform incremental development within code repositories.<n>We collect pull requests from 83 GitHub repositories and use rule-based and intent-based filtering to construct task instances focused on new feature development.
Score: 26.14778133391999
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Implementing new features in repository-level codebases is a crucial application of code generation models. However, current benchmarks lack a dedicated evaluation framework for this capability. To fill this gap, we introduce FEA-Bench, a benchmark designed to assess the ability of large language models (LLMs) to perform incremental development within code repositories. We collect pull requests from 83 GitHub repositories and use rule-based and intent-based filtering to construct task instances focused on new feature development. Each task instance containing code changes is paired with relevant unit test files to ensure that the solution can be verified. The feature implementation requires LLMs to simultaneously possess code completion capabilities for new components and code editing abilities for other relevant parts in the code repository, providing a more comprehensive evaluation method of LLMs' automated software engineering capabilities. Experimental results show that LLMs perform significantly worse in the FEA-Bench, highlighting considerable challenges in such repository-level incremental code development.

Related papers

AlignCoder: Aligning Retrieval with Target Intent for Repository-Level Code Completion [55.21541958868449]
We propose AlignCoder, a repository-level code completion framework.<n>Our framework generates an enhanced query that bridges the semantic gap between the initial query and the target code.<n>We employ reinforcement learning to train an AlignRetriever that learns to leverage inference information in the enhanced query for more accurate retrieval.
arXiv Detail & Related papers (2026-01-27T15:23:14Z)
Environment-Aware Code Generation: How far are We? [52.69113158357018]
It is unclear whether large language models (LLMs) can reliably generate executable code tailored to a user's specific environment.<n>We present the first systematic study of Environment-Aware Code Generation (EACG), where generated code must be functionally correct and directly executable under arbitrary software configurations.<n>Our results show that current LLMs struggle with environment-specific code generation, while our adaptations improve environment compatibility and executability.
arXiv Detail & Related papers (2026-01-18T04:58:15Z)
On LLM-Assisted Generation of Smart Contracts from Business Processes [0.08192907805418582]
Large language models (LLMs) have changed the reality of how software is produced.<n>We present an exploratory study to investigate the use of LLMs for generating smart contract code from business process descriptions.<n>Our results show that LLM performance falls short of the perfect reliability required for smart contract development.
arXiv Detail & Related papers (2025-07-30T20:39:45Z)
SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories? [32.67971774793393]
SWE-Perf is the first benchmark designed to evaluate Large Language Models (LLMs) on code performance optimization tasks within authentic repository contexts.<n>SWE-Perf comprises 140 carefully curated instances, each derived from performance-improving pull requests from popular GitHub repositories.
arXiv Detail & Related papers (2025-07-16T17:05:17Z)
MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks [56.34018316319873]
We propose MERA Code, a benchmark for evaluating code for the latest code generation LLMs in Russian.<n>This benchmark includes 11 evaluation tasks that span 8 programming languages.<n>We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages.
arXiv Detail & Related papers (2025-07-16T14:31:33Z)
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation [24.090719826360342]
We introduce CodeIF, the first benchmark designed to assess the abilities of Large Language Models (LLMs) to adhere to task-oriented instructions within code generation scenarios. We conduct extensive experiments with LLMs, analyzing their strengths and limitations in meeting the demands of these tasks.
arXiv Detail & Related papers (2025-02-26T14:19:49Z)
Code Summarization Beyond Function Level [0.213063058314067]
This study investigated the effectiveness of code summarization models beyond the function level. The fine-tuned state-of-the-art CodeT5+ base model excelled in code summarization. Repository-level summarization exhibited promising potential but requires significant computational resources.
arXiv Detail & Related papers (2025-02-23T20:31:21Z)
Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z)
CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation. We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks. We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z)
On the Impacts of Contexts on Repository-Level Code Generation [5.641402231731082]
We present RepoExec, a novel benchmark designed to evaluate repository-level code generation.<n>We focus on three key aspects: executability, functional correctness through comprehensive test case generation, and accurate utilization of cross-file contexts.
arXiv Detail & Related papers (2024-06-17T10:45:22Z)
VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development. We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM) We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z)
Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository [4.767858874370881]
We introduce RepoClassBench, a benchmark designed to rigorously evaluate LLMs in generating class-level code within real-world repositories. RepoClassBench includes "Natural Language to Class generation" tasks across Java, Python & C# from a selection of repositories. We introduce Retrieve-Repotools-Reflect (RRR), a novel approach that equips LLMs with static analysis tools to iteratively navigate & reason about repository-level context.
arXiv Detail & Related papers (2024-04-22T03:52:54Z)
Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study [72.24266814625685]
We explore the performance of large language models (LLMs) across the entire software development lifecycle with DevEval.<n>DevEval features four programming languages, multiple domains, high-quality data collection, and carefully designed and verified metrics for each task.<n> Empirical studies show that current LLMs, including GPT-4, fail to solve the challenges presented within DevEval.
arXiv Detail & Related papers (2024-03-13T15:13:44Z)
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components. CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks. FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization. Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z)
A Review of Repository Level Prompting for LLMs [0.0]
Large Language Models (LLMs) have led to notable successes, such as achieving a 94.6% solve rate on the HumanEval benchmark. There is an increasing commercial push for repository-level inline code completion tools, such as GitHub Copilot and Tab Nine. This paper delves into the transition from individual coding problems to repository-scale solutions.
arXiv Detail & Related papers (2023-12-15T00:34:52Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets [75.64181719386497]
We present CRAFT, a tool creation and retrieval framework for large language models (LLMs) It creates toolsets specifically curated for the tasks and equips LLMs with a component that retrieves tools from these sets to enhance their capability to solve complex tasks. Our method is designed to be flexible and offers a plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and modalities, without any finetuning.
arXiv Detail & Related papers (2023-09-29T17:40:26Z)
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation [96.75695811963242]
RepoCoder is a framework to streamline the repository-level code completion process. It incorporates a similarity-based retriever and a pre-trained code language model. It consistently outperforms the vanilla retrieval-augmented code completion approach.
arXiv Detail & Related papers (2023-03-22T13:54:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.