ARKS: Active Retrieval in Knowledge Soup for Code Generation
- URL: http://arxiv.org/abs/2402.12317v1
- Date: Mon, 19 Feb 2024 17:37:28 GMT
- Title: ARKS: Active Retrieval in Knowledge Soup for Code Generation
- Authors: Hongjin Su, Shuyang Jiang, Yuhang Lai, Haoyuan Wu, Boao Shi, Che Liu,
Qian Liu, Tao Yu
- Abstract summary: We introduce Active Retrieval in Knowledge Soup (ARKS), an advanced strategy for generalizing large language models for code.
We employ an active retrieval strategy that iteratively refines the query and updates the knowledge soup.
Experimental results on ChatGPT and CodeLlama demonstrate a substantial improvement in the average execution accuracy of ARKS on LLMs.
- Score: 18.22108704150575
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recently the retrieval-augmented generation (RAG) paradigm has raised much
attention for its potential in incorporating external knowledge into large
language models (LLMs) without further training. While widely explored in
natural language applications, its utilization in code generation remains
under-explored. In this paper, we introduce Active Retrieval in Knowledge Soup
(ARKS), an advanced strategy for generalizing large language models for code.
In contrast to relying on a single source, we construct a knowledge soup
integrating web search, documentation, execution feedback, and evolved code
snippets. We employ an active retrieval strategy that iteratively refines the
query and updates the knowledge soup. To assess the performance of ARKS, we
compile a new benchmark comprising realistic coding problems associated with
frequently updated libraries and long-tail programming languages. Experimental
results on ChatGPT and CodeLlama demonstrate a substantial improvement in the
average execution accuracy of ARKS on LLMs. The analysis confirms the
effectiveness of our proposed knowledge soup and active retrieval strategies,
offering rich insights into the construction of effective retrieval-augmented
code generation (RACG) pipelines. Our model, code, and data are available at
https://arks-codegen.github.io.
Related papers
- CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval [103.116634967815]
We introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters.
Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework.
Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark.
arXiv Detail & Related papers (2024-11-19T16:54:45Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.
We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.
We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - Enhancing Retrieval-Augmented Large Language Models with Iterative
Retrieval-Generation Synergy [164.83371924650294]
We show that strong performance can be achieved by a method we call Iter-RetGen, which synergizes retrieval and generation in an iterative manner.
A model output shows what might be needed to finish a task, and thus provides an informative context for retrieving more relevant knowledge.
Iter-RetGen processes all retrieved knowledge as a whole and largely preserves the flexibility in generation without structural constraints.
arXiv Detail & Related papers (2023-05-24T16:17:36Z) - Synergistic Interplay between Search and Large Language Models for
Information Retrieval [141.18083677333848]
InteR allows RMs to expand knowledge in queries using LLM-generated knowledge collections.
InteR achieves overall superior zero-shot retrieval performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-05-12T11:58:15Z) - Active Retrieval Augmented Generation [123.68874416084499]
Augmenting large language models (LMs) by retrieving information from external knowledge resources is one promising solution.
Most existing retrieval augmented LMs employ a retrieve-and-generate setup that only retrieves information once based on the input.
We propose Forward-Looking Active REtrieval augmented generation (FLARE), a generic method which iteratively uses a prediction of the upcoming sentence to anticipate future content.
arXiv Detail & Related papers (2023-05-11T17:13:40Z) - REINFOREST: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models [11.78036105494679]
This paper introduces a novel code-to-code search technique that enhances the performance of Large Language Models (LLMs)
We present the first-ever code search method that encodes dynamic information during training without the need to execute either the corpus under search or the search query at inference time.
arXiv Detail & Related papers (2023-05-05T20:46:56Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.