Repository-Level Prompt Generation for Large Language Models of Code
- URL: http://arxiv.org/abs/2206.12839v3
- Date: Mon, 5 Jun 2023 18:43:50 GMT
- Title: Repository-Level Prompt Generation for Large Language Models of Code
- Authors: Disha Shrivastava, Hugo Larochelle, Daniel Tarlow
- Abstract summary: We propose a framework that learns to generate example-specific prompts using prompt proposals.
The prompt proposals take context from the entire repository.
We conduct experiments on the task of single-line code-autocompletion using code repositories taken from Google Code archives.
- Score: 28.98699307030983
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the success of large language models (LLMs) of code and their use as
code assistants (e.g. Codex used in GitHub Copilot), techniques for introducing
domain-specific knowledge in the prompt design process become important. In
this work, we propose a framework called Repo-Level Prompt Generator that
learns to generate example-specific prompts using prompt proposals. The prompt
proposals take context from the entire repository, thereby incorporating both
the structure of the repository and the context from other relevant files (e.g.
imports, parent class files). Our technique doesn't require any access to the
weights of the LLM, making it applicable in cases where we only have black-box
access to the LLM. We conduct experiments on the task of single-line
code-autocompletion using code repositories taken from Google Code archives. We
demonstrate that an oracle constructed from our prompt proposals gives a
remarkably high relative improvement of 36% over Codex, showing the quality of
these proposals. Further, we show that when we train a model to predict a
prompt proposal, we can achieve significant performance gains over Codex and
other baselines. We release our code, data, and trained checkpoints at:
\url{https://github.com/shrivastavadisha/repo_level_prompt_generation}.
Related papers
- AlignCoder: Aligning Retrieval with Target Intent for Repository-Level Code Completion [55.21541958868449]
We propose AlignCoder, a repository-level code completion framework.<n>Our framework generates an enhanced query that bridges the semantic gap between the initial query and the target code.<n>We employ reinforcement learning to train an AlignRetriever that learns to leverage inference information in the enhanced query for more accurate retrieval.
arXiv Detail & Related papers (2026-01-27T15:23:14Z) - Uncovering Code Insights: Leveraging GitHub Artifacts for Deeper Code Understanding [0.1358202049520503]
Large language models (LLMs) have shown promise in generating code explanations.<n>We propose a novel approach that leverages natural language artifacts from GitHub.<n>Our system consists of three components: one that extracts and structures relevant GitHub context, another that uses this context to generate high-level explanations of the code's purpose, and a third that validates the explanation.
arXiv Detail & Related papers (2025-11-05T15:31:42Z) - CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion [11.329578913209623]
Repository-level code completion automatically predicts the unfinished code based on the broader information from the repository.<n>CodeRAG is a framework tailored to identify relevant and necessary knowledge for retrieval-augmented repository-level code completion.
arXiv Detail & Related papers (2025-09-19T15:57:40Z) - CodeRAG: Supportive Code Retrieval on Bigraph for Real-World Code Generation [69.684886175768]
Large language models (LLMs) have shown promising performance in automated code generation.
In this paper, we propose CodeRAG, a retrieval-augmented code generation framework.
Experiments show that CodeRAG achieves significant improvements compared to no RAG scenarios.
arXiv Detail & Related papers (2025-04-14T09:51:23Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.
We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.
We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - Long Code Arena: a Set of Benchmarks for Long-Context Code Models [75.70507534322336]
Long Code Arena is a suite of six benchmarks for code processing tasks that require project-wide context.
These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization.
For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions.
arXiv Detail & Related papers (2024-06-17T14:58:29Z) - A Prompt Learning Framework for Source Code Summarization [24.33455799484519]
We propose a novel prompt learning framework for code summarization called PromptCS.
PromptCS trains a prompt agent that can generate continuous prompts to unleash the potential for LLMs in code summarization.
We evaluate PromptCS on the CodeSearchNet dataset involving multiple programming languages.
arXiv Detail & Related papers (2023-12-26T14:37:55Z) - A Review of Repository Level Prompting for LLMs [0.0]
Large Language Models (LLMs) have led to notable successes, such as achieving a 94.6% solve rate on the HumanEval benchmark.
There is an increasing commercial push for repository-level inline code completion tools, such as GitHub Copilot and Tab Nine.
This paper delves into the transition from individual coding problems to repository-scale solutions.
arXiv Detail & Related papers (2023-12-15T00:34:52Z) - Exploring Continual Learning for Code Generation Models [80.78036093054855]
Continual Learning (CL) is an important aspect that remains underexplored in the code domain.
We introduce a benchmark called CodeTask-CL that covers a wide range of tasks, including code generation, translation, summarization, and refinement.
We find that effective methods like Prompt Pooling (PP) suffer from catastrophic forgetting due to the unstable training of the prompt selection mechanism.
arXiv Detail & Related papers (2023-07-05T16:58:39Z) - LongCoder: A Long-Range Pre-trained Language Model for Code Completion [56.813974784131624]
LongCoder employs a sliding window mechanism for self-attention and introduces two types of globally accessible tokens.
Bridge tokens are inserted throughout the input sequence to aggregate local information and facilitate global interaction.
memory tokens are included to highlight important statements that may be invoked later and need to be memorized.
arXiv Detail & Related papers (2023-06-26T17:59:24Z) - RepoFusion: Training Code Models to Understand Your Repository [12.621282610983592]
Large Language Models (LLMs) in coding assistants like GitHub Copilot struggle to understand the context present in the repository.
Recent work has shown the promise of using context from the repository during inference.
We propose RepoFusion, a framework to train models to incorporate relevant repository context.
arXiv Detail & Related papers (2023-06-19T15:05:31Z) - RepoCoder: Repository-Level Code Completion Through Iterative Retrieval
and Generation [96.75695811963242]
RepoCoder is a framework to streamline the repository-level code completion process.
It incorporates a similarity-based retriever and a pre-trained code language model.
It consistently outperforms the vanilla retrieval-augmented code completion approach.
arXiv Detail & Related papers (2023-03-22T13:54:46Z) - Generation-Augmented Query Expansion For Code Retrieval [51.20943646688115]
We propose a generation-augmented query expansion framework.
Inspired by the human retrieval process - sketching an answer before searching.
We achieve new state-of-the-art results on the CodeSearchNet benchmark.
arXiv Detail & Related papers (2022-12-20T23:49:37Z) - CoreGen: Contextualized Code Representation Learning for Commit Message
Generation [39.383390029545865]
We propose a novel Contextualized code representation learning strategy for commit message Generation (CoreGen)
Experiments on the benchmark dataset demonstrate the superior effectiveness of our model over the baseline models with at least 28.18% improvement in terms of BLEU-4 score.
arXiv Detail & Related papers (2020-07-14T09:43:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.