Related papers: RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

URL: http://arxiv.org/abs/2306.03091v2
Date: Wed, 4 Oct 2023 01:13:49 GMT
Title: RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
Authors: Tianyang Liu, Canwen Xu, Julian McAuley
Abstract summary: RepoBench is a benchmark for evaluating code auto-completion systems. It consists of three evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline)
Score: 43.797002322559834
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have greatly advanced code auto-completion systems, with a potential for substantial productivity enhancements for developers. However, current benchmarks mainly focus on single-file tasks, leaving an assessment gap for more complex, real-world, multi-file programming scenarios. To fill this gap, we introduce RepoBench, a new benchmark specifically designed for evaluating repository-level code auto-completion systems. RepoBench supports both Python and Java and consists of three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline). Each task respectively measures the system's ability to retrieve the most relevant code snippets from other files as cross-file context, predict the next line of code with cross-file and in-file context, and handle complex tasks that require a combination of both retrieval and next-line prediction. RepoBench aims to facilitate a more complete comparison of performance and encouraging continuous improvement in auto-completion systems. RepoBench is publicly available at https://github.com/Leolty/repobench.

Related papers

SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents [49.73885480071402]
We introduce SWE-PolyBench, a new benchmark for repository-level, execution-based evaluation of coding agents. SWE-PolyBench contains 2110 instances from 21 repositories and includes tasks in Java (165), JavaScript (1017), TypeScript (729) and Python (199), covering bug fixes, feature additions, and code. Our experiments show that current agents exhibit uneven performances across languages and struggle with complex problems while showing higher performance on simpler tasks.
arXiv Detail & Related papers (2025-04-11T17:08:02Z)
Automated Benchmark Generation for Repository-Level Coding Tasks [7.305342793164905]
SetUpAgent is a fully automated system capable of historically accurate dependency setup, test execution, and result parsing. We generate two new datasets: (i) SWEE-Bench an extended version of SWE-Bench encompassing hundreds of repositories, and (ii) SWA-Bench a benchmark focusing on applications rather than libraries.
arXiv Detail & Related papers (2025-03-10T17:42:49Z)
ExecRepoBench: Multi-level Executable Code Completion Evaluation [45.963424627710765]
We introduce a novel framework for enhancing code completion in software development through the creation of a repository-level benchmark ExecRepoBench. We present a multi-level grammar-based completion methodology conditioned on the abstract syntax tree to mask code fragments at various logical units. Then, we fine-tune the open-source LLM with 7B parameters on Repo-Instruct to produce a strong code completion baseline model Qwen2.5-Coder-Instruct-C.
arXiv Detail & Related papers (2024-12-16T17:14:35Z)
PyBench: Evaluating LLM Agent on various real-world coding tasks [13.347173063163138]
PyBench is a benchmark covering five main categories of real-world tasks, covering more than 10 types of files. Our evaluations indicate that current open-source LLMs are struggling with these tasks. Our fine-tuned 8B size model: textbfPyLlama3 achieves an exciting performance on PyBench.
arXiv Detail & Related papers (2024-07-23T15:23:14Z)
CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [56.691926887209895]
We present textbfname (textbfInformation textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities. name comprises textbften meticulously curated code datasets, spanning textbfeight distinctive retrieval tasks across textbfseven diverse domains. We evaluate nine widely used retrieval models using name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z)
CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation. We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks. We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z)
PruningBench: A Comprehensive Benchmark of Structural Pruning [50.23493036025595]
We present the first comprehensive benchmark, termed textitPruningBench, for structural pruning. PruningBench employs a unified and consistent framework for evaluating the effectiveness of diverse structural pruning techniques. It provides easily implementable interfaces to facilitate the implementation of future pruning methods, and enables the subsequent researchers to incorporate their work into our leaderboards.
arXiv Detail & Related papers (2024-06-18T06:37:26Z)
Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository [4.767858874370881]
We introduce RepoClassBench, a benchmark designed to rigorously evaluate LLMs in generating class-level code within real-world repositories. RepoClassBench includes "Natural Language to Class generation" tasks across Java, Python & C# from a selection of repositories. We introduce Retrieve-Repotools-Reflect (RRR), a novel approach that equips LLMs with static analysis tools to iteratively navigate & reason about repository-level context.
arXiv Detail & Related papers (2024-04-22T03:52:54Z)
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion [86.01508183157613]
CrossCodeEval is built on a diverse set of real-world, open-sourced, permissively-licensed repositories in four popular programming languages. We show that CrossCodeEval is extremely challenging when the relevant cross-file context is absent. We also show that CrossCodeEval can also be used to measure the capability of code retrievers.
arXiv Detail & Related papers (2023-10-17T13:18:01Z)
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation [96.75695811963242]
RepoCoder is a framework to streamline the repository-level code completion process. It incorporates a similarity-based retriever and a pre-trained code language model. It consistently outperforms the vanilla retrieval-augmented code completion approach.
arXiv Detail & Related papers (2023-03-22T13:54:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.