Preference-Guided Refactored Tuning for Retrieval Augmented Code Generation
- URL: http://arxiv.org/abs/2409.15895v1
- Date: Tue, 24 Sep 2024 09:15:37 GMT
- Title: Preference-Guided Refactored Tuning for Retrieval Augmented Code Generation
- Authors: Xinyu Gao, Yun Xiong, Deze Wang, Zhenhan Guan, Zejian Shi, Haofen Wang, Shanshan Li,
- Abstract summary: We propose RRG (Retrieve, Refactor, Generate), a novel framework for effective and efficient code generation.
This framework introduces a code-auger module between the retriever and the generator to bridge them.
RRG achieved significant performance improvements, with increases of up to 28% on EM, 13% on BLEU, and 6.8% on CodeBLEU.
- Score: 10.736876118242384
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieval-augmented code generation utilizes Large Language Models as the generator and significantly expands their code generation capabilities by providing relevant code, documentation, and more via the retriever. The current approach suffers from two primary limitations: 1) information redundancy. The indiscriminate inclusion of redundant information can result in resource wastage and may misguide generators, affecting their effectiveness and efficiency. 2) preference gap. Due to different optimization objectives, the retriever strives to procure code with higher ground truth similarity, yet this effort does not substantially benefit the generator. The retriever and the generator may prefer different golden code, and this gap in preference results in a suboptimal design. Additionally, differences in parameterization knowledge acquired during pre-training result in varying preferences among different generators. To address these limitations, in this paper, we propose RRG (Retrieve, Refactor, Generate), a novel framework for effective and efficient code generation. This framework introduces a code refactorer module between the retriever and the generator to bridge them. The refactoring process transforms the raw retrieved code into a more concise, efficient, and model-friendly version. It eliminates redundant information and noise, reducing the input length. Consequently, the generator receives higher-quality context, enabling it to produce more accurate results with lower inference costs. We conducted comprehensive experiments on multiple datasets. In the experiments, we confirmed the existence of a preference gap between the retriever and the generator, and RRG effectively bridges this gap. Specifically, RRG achieved significant performance improvements, with increases of up to 28% on EM, 13% on BLEU, and 6.8% on CodeBLEU.
Related papers
- FunnelRAG: A Coarse-to-Fine Progressive Retrieval Paradigm for RAG [22.4664221738095]
Retrieval-Augmented Generation (RAG) prevails in Large Language Models.
We propose a progressive retrieval paradigm with coarse-to-fine granularity for RAG.
arXiv Detail & Related papers (2024-10-14T08:47:21Z) - RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation [54.707460684650584]
Large Language Models (LLMs) demonstrate human-level capabilities in dialogue, reasoning, and knowledge retention.
Current research addresses this bottleneck by equipping LLMs with external knowledge, a technique known as Retrieval Augmented Generation (RAG)
RAGLAB is a modular and research-oriented open-source library that reproduces 6 existing algorithms and provides a comprehensive ecosystem for investigating RAG algorithms.
arXiv Detail & Related papers (2024-08-21T07:20:48Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.
We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.
We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - Prompt Optimization via Adversarial In-Context Learning [51.18075178593142]
adv-ICL is implemented as a two-player game between a generator and a discriminator.
The generator tries to generate realistic enough output to fool the discriminator.
We show that adv-ICL results in significant improvements over state-of-the-art prompt optimization techniques.
arXiv Detail & Related papers (2023-12-05T09:44:45Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - MGR: Multi-generator Based Rationalization [14.745836934156427]
Rationalization is to employ a generator and a predictor to construct a self-explaining NLP model.
In this paper, we propose a simple yet effective method named MGR to simultaneously solve the two problems.
We show that MGR improves the F1 score by up to 20.9% as compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-05-08T06:36:46Z) - Joint Generator-Ranker Learning for Natural Language Generation [99.16268050116717]
JGR is a novel joint training algorithm that integrates the generator and the ranker in a single framework.
By iteratively updating the generator and the ranker, JGR can effectively harmonize their learning and enhance their quality jointly.
arXiv Detail & Related papers (2022-06-28T12:58:30Z) - Highly Parallel Autoregressive Entity Linking with Discriminative
Correction [51.947280241185]
We propose a very efficient approach that parallelizes autoregressive linking across all potential mentions.
Our model is >70 times faster and more accurate than the previous generative method.
arXiv Detail & Related papers (2021-09-08T17:28:26Z) - Improving GANs for Speech Enhancement [19.836041050328102]
We propose to use multiple generators chained to perform multi-stage enhancement mapping.
We demonstrate that the proposed multi-stage enhancement approach outperforms the one-stage SEGAN baseline.
arXiv Detail & Related papers (2020-01-15T19:57:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.