Retrieval Augmented Code Generation and Summarization
- URL: http://arxiv.org/abs/2108.11601v1
- Date: Thu, 26 Aug 2021 06:48:13 GMT
- Title: Retrieval Augmented Code Generation and Summarization
- Authors: Md Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray,
Kai-Wei Chang
- Abstract summary: We propose a retrieval augmented framework, tool, that retrieves relevant code or summaries from a retrieval database.
tool extends the state-of-the-art dense retrieval technique to search for relevant code or summaries.
We conduct experiments and extensive analysis on two benchmark datasets of code generation and summarization in Java and Python.
- Score: 43.823483197436076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Software developers write a lot of source code and documentation during
software development. Intrinsically, developers often recall parts of source
code or code summaries that they had written in the past while implementing
software or documenting them. To mimic developers' code or summary generation
behavior, we propose a retrieval augmented framework, \tool, that retrieves
relevant code or summaries from a retrieval database and provides them as a
supplement to code generation or summarization models. \tool has a couple of
uniqueness. First, it extends the state-of-the-art dense retrieval technique to
search for relevant code or summaries. Second, it can work with retrieval
databases that include unimodal (only code or natural language description) or
bimodal instances (code-description pairs). We conduct experiments and
extensive analysis on two benchmark datasets of code generation and
summarization in Java and Python, and the promising results endorse the
effectiveness of our proposed retrieval augmented framework.
Related papers
- CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval [103.116634967815]
We introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters.
Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework.
Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark.
arXiv Detail & Related papers (2024-11-19T16:54:45Z) - Building A Coding Assistant via the Retrieval-Augmented Language Model [24.654428111628242]
We propose a retrieval-augmeNted language model (CONAN) to build a code assistant by mimicking the knowledge-seeking behaviors of humans during coding.
It consists of a code structure aware retriever (CONAN-R) and a dual-view code representation-based retrieval-augmented generation model (CONAN-G)
arXiv Detail & Related papers (2024-10-21T17:34:39Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.
We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.
We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - Generation-Augmented Query Expansion For Code Retrieval [51.20943646688115]
We propose a generation-augmented query expansion framework.
Inspired by the human retrieval process - sketching an answer before searching.
We achieve new state-of-the-art results on the CodeSearchNet benchmark.
arXiv Detail & Related papers (2022-12-20T23:49:37Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - AugmentedCode: Examining the Effects of Natural Language Resources in
Code Retrieval Models [5.112140303263898]
We introduce Augmented Code (AugmentedCode) retrieval which takes advantage of existing information within the code.
We showcased the the results of augmented programming language which outperforms on CodeSearchNet and CodeBERT with a Mean Reciprocal Rank (MRR) of 0.73 and 0.96.
arXiv Detail & Related papers (2021-10-16T08:44:48Z) - Leveraging Code Generation to Improve Code Retrieval and Summarization
via Dual Learning [18.354352985591305]
Code summarization generates brief natural language description given a source code snippet, while code retrieval fetches relevant source code given a natural language query.
Recent studies have combined these two tasks to improve their performance.
We propose a novel end-to-end model for the two tasks by introducing an additional code generation task.
arXiv Detail & Related papers (2020-02-24T12:26:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.