RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
- URL: http://arxiv.org/abs/2306.03091v2
- Date: Wed, 4 Oct 2023 01:13:49 GMT
- Title: RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
- Authors: Tianyang Liu, Canwen Xu, Julian McAuley
- Abstract summary: RepoBench is a benchmark for evaluating code auto-completion systems.
It consists of three evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline)
- Score: 43.797002322559834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have greatly advanced code auto-completion
systems, with a potential for substantial productivity enhancements for
developers. However, current benchmarks mainly focus on single-file tasks,
leaving an assessment gap for more complex, real-world, multi-file programming
scenarios. To fill this gap, we introduce RepoBench, a new benchmark
specifically designed for evaluating repository-level code auto-completion
systems. RepoBench supports both Python and Java and consists of three
interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code
Completion), and RepoBench-P (Pipeline). Each task respectively measures the
system's ability to retrieve the most relevant code snippets from other files
as cross-file context, predict the next line of code with cross-file and
in-file context, and handle complex tasks that require a combination of both
retrieval and next-line prediction. RepoBench aims to facilitate a more
complete comparison of performance and encouraging continuous improvement in
auto-completion systems. RepoBench is publicly available at
https://github.com/Leolty/repobench.
Related papers
- PyBench: Evaluating LLM Agent on various real-world coding tasks [13.347173063163138]
PyBench is a benchmark covering five main categories of real-world tasks, covering more than 10 types of files.
Our evaluations indicate that current open-source LLMs are struggling with these tasks.
Our fine-tuned 8B size model: textbfPyLlama3 achieves an exciting performance on PyBench.
arXiv Detail & Related papers (2024-07-23T15:23:14Z) - CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [56.691926887209895]
We present textbfname (textbfInformation textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities.
name comprises textbften meticulously curated code datasets, spanning textbfeight distinctive retrieval tasks across textbfseven diverse domains.
We evaluate nine widely used retrieval models using name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.
We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.
We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - PruningBench: A Comprehensive Benchmark of Structural Pruning [50.23493036025595]
We present the first comprehensive benchmark, termed textitPruningBench, for structural pruning.
PruningBench employs a unified and consistent framework for evaluating the effectiveness of diverse structural pruning techniques.
It provides easily implementable interfaces to facilitate the implementation of future pruning methods, and enables the subsequent researchers to incorporate their work into our leaderboards.
arXiv Detail & Related papers (2024-06-18T06:37:26Z) - Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository [4.767858874370881]
We introduce RepoClassBench, a benchmark designed to rigorously evaluate LLMs in generating class-level code within real-world repositories.
RepoClassBench includes "Natural Language to Class generation" tasks across Java, Python & C# from a selection of repositories.
We introduce Retrieve-Repotools-Reflect (RRR), a novel approach that equips LLMs with static analysis tools to iteratively navigate & reason about repository-level context.
arXiv Detail & Related papers (2024-04-22T03:52:54Z) - CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code
Completion [86.01508183157613]
CrossCodeEval is built on a diverse set of real-world, open-sourced, permissively-licensed repositories in four popular programming languages.
We show that CrossCodeEval is extremely challenging when the relevant cross-file context is absent.
We also show that CrossCodeEval can also be used to measure the capability of code retrievers.
arXiv Detail & Related papers (2023-10-17T13:18:01Z) - RepoCoder: Repository-Level Code Completion Through Iterative Retrieval
and Generation [96.75695811963242]
RepoCoder is a framework to streamline the repository-level code completion process.
It incorporates a similarity-based retriever and a pre-trained code language model.
It consistently outperforms the vanilla retrieval-augmented code completion approach.
arXiv Detail & Related papers (2023-03-22T13:54:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.