CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion
- URL: http://arxiv.org/abs/2509.16112v1
- Date: Fri, 19 Sep 2025 15:57:40 GMT
- Title: CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion
- Authors: Sheng Zhang, Yifan Ding, Shuquan Lian, Shun Song, Hui Li,
- Abstract summary: Repository-level code completion automatically predicts the unfinished code based on the broader information from the repository.<n>CodeRAG is a framework tailored to identify relevant and necessary knowledge for retrieval-augmented repository-level code completion.
- Score: 11.329578913209623
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Repository-level code completion automatically predicts the unfinished code based on the broader information from the repository. Recent strides in Code Large Language Models (code LLMs) have spurred the development of repository-level code completion methods, yielding promising results. Nevertheless, they suffer from issues such as inappropriate query construction, single-path code retrieval, and misalignment between code retriever and code LLM. To address these problems, we introduce CodeRAG, a framework tailored to identify relevant and necessary knowledge for retrieval-augmented repository-level code completion. Its core components include log probability guided query construction, multi-path code retrieval, and preference-aligned BestFit reranking. Extensive experiments on benchmarks ReccEval and CCEval demonstrate that CodeRAG significantly and consistently outperforms state-of-the-art methods. The implementation of CodeRAG is available at https://github.com/KDEGroup/CodeRAG.
Related papers
- AlignCoder: Aligning Retrieval with Target Intent for Repository-Level Code Completion [55.21541958868449]
We propose AlignCoder, a repository-level code completion framework.<n>Our framework generates an enhanced query that bridges the semantic gap between the initial query and the target code.<n>We employ reinforcement learning to train an AlignRetriever that learns to leverage inference information in the enhanced query for more accurate retrieval.
arXiv Detail & Related papers (2026-01-27T15:23:14Z) - CodeRAG: Supportive Code Retrieval on Bigraph for Real-World Code Generation [69.684886175768]
Large language models (LLMs) have shown promising performance in automated code generation.<n>In this paper, we propose CodeRAG, a retrieval-augmented code generation framework.<n> Experiments show that CodeRAG achieves significant improvements compared to no RAG scenarios.
arXiv Detail & Related papers (2025-04-14T09:51:23Z) - Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework.
Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z) - Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs [24.00351065427465]
We propose a strategy named Hierarchical Context Pruning (HCP) to construct completion prompts with high informational code content.
The HCP models the code repository at the function level, maintaining the topological dependencies between code files while removing a large amount of irrelevant code content.
arXiv Detail & Related papers (2024-06-26T12:26:16Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.<n>We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.<n>We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model [30.625128161499195]
GraphCoder is a retrieval-augmented code completion framework.
It uses general code knowledge and the repository-specific knowledge via a graph-based retrieval-generation process.
It achieves higher exact match (EM) on average, with increases of +6.06 in code match and +6.23 in identifier match, while using less time and space.
arXiv Detail & Related papers (2024-06-11T06:55:32Z) - R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models [62.20537000942005]
We propose the R2C2-Coder to enhance and benchmark the real-world repository-level code completion abilities of code Large Language Models.<n>R2C2-Coder includes a code prompt construction method R2C2-Enhance and a well-designed benchmark R2C2-Bench.
arXiv Detail & Related papers (2024-06-03T14:24:29Z) - RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion [12.173834895070827]
tool is a framework designed to address the complex challenges associated with repository-level code completion.
Central to tool is the em Repo-level Semantic Graph (RSG), a novel semantic graph structure that encapsulates the vast context of code repositories.
Our evaluations show that tool markedly outperforms existing techniques in repository-level code completion.
arXiv Detail & Related papers (2024-03-10T05:10:34Z) - RepoCoder: Repository-Level Code Completion Through Iterative Retrieval
and Generation [96.75695811963242]
RepoCoder is a framework to streamline the repository-level code completion process.
It incorporates a similarity-based retriever and a pre-trained code language model.
It consistently outperforms the vanilla retrieval-augmented code completion approach.
arXiv Detail & Related papers (2023-03-22T13:54:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.