A^3-CodGen: A Repository-Level Code Generation Framework for Code Reuse
with Local-Aware, Global-Aware, and Third-Party-Library-Aware
- URL: http://arxiv.org/abs/2312.05772v4
- Date: Tue, 5 Mar 2024 08:52:59 GMT
- Title: A^3-CodGen: A Repository-Level Code Generation Framework for Code Reuse
with Local-Aware, Global-Aware, and Third-Party-Library-Aware
- Authors: Dianshu Liao, Shidong Pan, Xiaoyu Sun, Xiaoxue Ren, Qing Huang,
Zhenchang Xing, Huan Jin, Qinying Li
- Abstract summary: We propose a novel code generation framework, dubbed A3-CodGen, to harness information within the code repository to generate code with fewer potential logical errors.
We identify three categories of representative information for the code repository: local-aware information from current code file, global-aware information from other code files, and third-party-library information.
Results demonstrate that by adopting the A3-CodGen framework, we successfully extract, fuse, and feed code repository information into the LLM, generating more accurate, efficient, and highly reusable code.
- Score: 13.850755485655435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code generation tools are essential to help developers in the software
development process. Existing tools often disconnect with the working context,
i.e., the code repository, causing the generated code to be not similar to
human developers. In this paper, we propose a novel code generation framework,
dubbed A^3-CodGen, to harness information within the code repository to
generate code with fewer potential logical errors, code redundancy, and
library-induced compatibility issues. We identify three categories of
representative information for the code repository: local-aware information
from current code file, global-aware information from other code files, and
third-party-library information. Results demonstrate that by adopting the
A^3-CodGen framework, we successfully extract, fuse, and feed code repository
information into the LLM, generating more accurate, efficient, and highly
reusable code. The effectiveness of our framework is further underscored by
generating code with a higher reuse rate, compared to human developers. This
research contributes significantly to the field of code generation, providing
developers with a more powerful tool to address the evolving demands in
software development in practice.
Related papers
- CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.
We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.
We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - A Study on Developer Behaviors for Validating and Repairing LLM-Generated Code Using Eye Tracking and IDE Actions [13.58143103712]
GitHub Copilot is a large language model (LLM)-powered code generation tool.
This paper investigates how developers validate and repair code generated by Copilot.
Being aware of the code's provenance led to improved performance, increased search efforts, more frequent Copilot usage, and higher cognitive workload.
arXiv Detail & Related papers (2024-05-25T06:20:01Z) - CodeGRAG: Extracting Composed Syntax Graphs for Retrieval Augmented Cross-Lingual Code Generation [60.799992690487336]
We propose Syntax Graph Retrieval Augmented Code Generation (CodeGRAG) to enhance the performance of LLMs in single-round code generation tasks.
CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - CodeCloak: A Method for Evaluating and Mitigating Code Leakage by LLM Code Assistants [23.462703429753706]
We propose two complementary methods to mitigate the risk of code leakage when using LLM-based code assistants.
The first is a technique for reconstructing a developer's original from code segments sent to the code assistant service.
The second is CodeCloak, a novel deep reinforcement learning agent that manipulates the prompts before sending them to the code assistant service.
arXiv Detail & Related papers (2024-04-13T19:30:58Z) - CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence.
Our library supports a collection of pretrained Code LLM models and popular code benchmarks.
We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z) - RepoCoder: Repository-Level Code Completion Through Iterative Retrieval
and Generation [96.75695811963242]
RepoCoder is a framework to streamline the repository-level code completion process.
It incorporates a similarity-based retriever and a pre-trained code language model.
It consistently outperforms the vanilla retrieval-augmented code completion approach.
arXiv Detail & Related papers (2023-03-22T13:54:46Z) - Knowledge Transfer for Pseudo-code Generation from Low Resource
Programming Language [13.716669765394293]
We focus on transferring the knowledge acquired by the code-to-pseudocode neural model trained on a high resource PL (C++) using parallel code-pseudocode data.
We observe an improvement of 23.27% in the success rate of the generated C codes through back translation.
arXiv Detail & Related papers (2023-03-16T03:38:08Z) - StructCoder: Structure-Aware Transformer for Code Generation [13.797842927671846]
We introduce a structure-aware Transformer decoder that models both syntax and data flow to enhance the quality of generated code.
The proposed StructCoder model achieves state-of-the-art performance on code translation and text-to-code generation tasks.
arXiv Detail & Related papers (2022-06-10T17:26:31Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - DeSkew-LSH based Code-to-Code Recommendation Engine [3.7011129410662558]
We present emphSenatus, a new code-to-code recommendation engine for machine learning on source code.
At the core of Senatus is emphDe-Skew LSH, a new locality sensitive hashing algorithm that indexes the data for fast (sub-linear time) retrieval.
We show Senatus improves performance by 6.7% F1 and query time 16x is faster compared to Facebook Aroma on the task of code-to-code recommendation.
arXiv Detail & Related papers (2021-11-05T16:56:28Z) - Incorporating External Knowledge through Pre-training for Natural
Language to Code Generation [97.97049697457425]
Open-domain code generation aims to generate code in a general-purpose programming language from natural language (NL) intents.
We explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation.
Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa.
arXiv Detail & Related papers (2020-04-20T01:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.