RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
- URL: http://arxiv.org/abs/2306.03091v2
- Date: Wed, 4 Oct 2023 01:13:49 GMT
- Title: RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
- Authors: Tianyang Liu, Canwen Xu, Julian McAuley
- Abstract summary: RepoBench is a benchmark for evaluating code auto-completion systems.
It consists of three evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline)
- Score: 43.797002322559834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have greatly advanced code auto-completion
systems, with a potential for substantial productivity enhancements for
developers. However, current benchmarks mainly focus on single-file tasks,
leaving an assessment gap for more complex, real-world, multi-file programming
scenarios. To fill this gap, we introduce RepoBench, a new benchmark
specifically designed for evaluating repository-level code auto-completion
systems. RepoBench supports both Python and Java and consists of three
interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code
Completion), and RepoBench-P (Pipeline). Each task respectively measures the
system's ability to retrieve the most relevant code snippets from other files
as cross-file context, predict the next line of code with cross-file and
in-file context, and handle complex tasks that require a combination of both
retrieval and next-line prediction. RepoBench aims to facilitate a more
complete comparison of performance and encouraging continuous improvement in
auto-completion systems. RepoBench is publicly available at
https://github.com/Leolty/repobench.
Related papers
- ExecRepoBench: Multi-level Executable Code Completion Evaluation [45.963424627710765]
We introduce a novel framework for enhancing code completion in software development through the creation of a repository-level benchmark ExecRepoBench.
We present a multi-level grammar-based completion methodology conditioned on the abstract syntax tree to mask code fragments at various logical units.
Then, we fine-tune the open-source LLM with 7B parameters on Repo-Instruct to produce a strong code completion baseline model Qwen2.5-Coder-Instruct-C.
arXiv Detail & Related papers (2024-12-16T17:14:35Z) - PyBench: Evaluating LLM Agent on various real-world coding tasks [13.347173063163138]
PyBench is a benchmark covering five main categories of real-world tasks, covering more than 10 types of files.
Our evaluations indicate that current open-source LLMs are struggling with these tasks.
Our fine-tuned 8B size model: textbfPyLlama3 achieves an exciting performance on PyBench.
arXiv Detail & Related papers (2024-07-23T15:23:14Z) - CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [56.691926887209895]
We present textbfname (textbfInformation textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities.
name comprises textbften meticulously curated code datasets, spanning textbfeight distinctive retrieval tasks across textbfseven diverse domains.
We evaluate nine widely used retrieval models using name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.
We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.
We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - On the Impacts of Contexts on Repository-Level Code Generation [5.641402231731082]
We present RepoExec, a novel benchmark designed to evaluate repository-level code generation.
We focus on three key aspects: executability, functional correctness through comprehensive test case generation, and accurate utilization of cross-file contexts.
arXiv Detail & Related papers (2024-06-17T10:45:22Z) - Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository [4.767858874370881]
We introduce RepoClassBench, a benchmark designed to rigorously evaluate LLMs in generating class-level code within real-world repositories.
RepoClassBench includes "Natural Language to Class generation" tasks across Java, Python & C# from a selection of repositories.
We introduce Retrieve-Repotools-Reflect (RRR), a novel approach that equips LLMs with static analysis tools to iteratively navigate & reason about repository-level context.
arXiv Detail & Related papers (2024-04-22T03:52:54Z) - CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code
Completion [86.01508183157613]
CrossCodeEval is built on a diverse set of real-world, open-sourced, permissively-licensed repositories in four popular programming languages.
We show that CrossCodeEval is extremely challenging when the relevant cross-file context is absent.
We also show that CrossCodeEval can also be used to measure the capability of code retrievers.
arXiv Detail & Related papers (2023-10-17T13:18:01Z) - RepoCoder: Repository-Level Code Completion Through Iterative Retrieval
and Generation [96.75695811963242]
RepoCoder is a framework to streamline the repository-level code completion process.
It incorporates a similarity-based retriever and a pre-trained code language model.
It consistently outperforms the vanilla retrieval-augmented code completion approach.
arXiv Detail & Related papers (2023-03-22T13:54:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.