ARKS: Active Retrieval in Knowledge Soup for Code Generation
- URL: http://arxiv.org/abs/2402.12317v1
- Date: Mon, 19 Feb 2024 17:37:28 GMT
- Title: ARKS: Active Retrieval in Knowledge Soup for Code Generation
- Authors: Hongjin Su, Shuyang Jiang, Yuhang Lai, Haoyuan Wu, Boao Shi, Che Liu,
Qian Liu, Tao Yu
- Abstract summary: We introduce Active Retrieval in Knowledge Soup (ARKS), an advanced strategy for generalizing large language models for code.
We employ an active retrieval strategy that iteratively refines the query and updates the knowledge soup.
Experimental results on ChatGPT and CodeLlama demonstrate a substantial improvement in the average execution accuracy of ARKS on LLMs.
- Score: 18.22108704150575
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recently the retrieval-augmented generation (RAG) paradigm has raised much
attention for its potential in incorporating external knowledge into large
language models (LLMs) without further training. While widely explored in
natural language applications, its utilization in code generation remains
under-explored. In this paper, we introduce Active Retrieval in Knowledge Soup
(ARKS), an advanced strategy for generalizing large language models for code.
In contrast to relying on a single source, we construct a knowledge soup
integrating web search, documentation, execution feedback, and evolved code
snippets. We employ an active retrieval strategy that iteratively refines the
query and updates the knowledge soup. To assess the performance of ARKS, we
compile a new benchmark comprising realistic coding problems associated with
frequently updated libraries and long-tail programming languages. Experimental
results on ChatGPT and CodeLlama demonstrate a substantial improvement in the
average execution accuracy of ARKS on LLMs. The analysis confirms the
effectiveness of our proposed knowledge soup and active retrieval strategies,
offering rich insights into the construction of effective retrieval-augmented
code generation (RACG) pipelines. Our model, code, and data are available at
https://arks-codegen.github.io.
Related papers
- CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval [103.116634967815]
We introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters.
Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework.
Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark.
arXiv Detail & Related papers (2024-11-19T16:54:45Z) - Developing Retrieval Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report [3.4632900249241874]
This paper presents an experience report on the development of Retrieval Augmented Generation (RAG) systems using PDF documents as the primary data source.
The RAG architecture combines generative capabilities of Large Language Models (LLMs) with the precision of information retrieval.
The practical implications of this research lie in enhancing the reliability of generative AI systems in various sectors.
arXiv Detail & Related papers (2024-10-21T12:21:49Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.
We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.
We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development.
We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM)
We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - SelfEvolve: A Code Evolution Framework via Large Language Models [5.6607714367826105]
Large language models (LLMs) have already revolutionized code generation, after being pretrained on publicly available code data.
We propose a novel two-step pipeline, called autoknow, that leverages LLMs as both knowledge providers and self-reflective programmers.
We evaluate autoknowon three code generation datasets, including DS-1000 for data science code, HumanEval for software engineering code, and TransCoder for C++-to-Python translation.
arXiv Detail & Related papers (2023-06-05T14:12:46Z) - Enhancing Retrieval-Augmented Large Language Models with Iterative
Retrieval-Generation Synergy [164.83371924650294]
We show that strong performance can be achieved by a method we call Iter-RetGen, which synergizes retrieval and generation in an iterative manner.
A model output shows what might be needed to finish a task, and thus provides an informative context for retrieving more relevant knowledge.
Iter-RetGen processes all retrieved knowledge as a whole and largely preserves the flexibility in generation without structural constraints.
arXiv Detail & Related papers (2023-05-24T16:17:36Z) - Active Retrieval Augmented Generation [123.68874416084499]
Augmenting large language models (LMs) by retrieving information from external knowledge resources is one promising solution.
Most existing retrieval augmented LMs employ a retrieve-and-generate setup that only retrieves information once based on the input.
We propose Forward-Looking Active REtrieval augmented generation (FLARE), a generic method which iteratively uses a prediction of the upcoming sentence to anticipate future content.
arXiv Detail & Related papers (2023-05-11T17:13:40Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - Leveraging Code Generation to Improve Code Retrieval and Summarization
via Dual Learning [18.354352985591305]
Code summarization generates brief natural language description given a source code snippet, while code retrieval fetches relevant source code given a natural language query.
Recent studies have combined these two tasks to improve their performance.
We propose a novel end-to-end model for the two tasks by introducing an additional code generation task.
arXiv Detail & Related papers (2020-02-24T12:26:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.