CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation
- URL: http://arxiv.org/abs/2405.02355v3
- Date: Fri, 08 Nov 2024 14:17:05 GMT
- Title: CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation
- Authors: Kounianhua Du, Jizheng Chen, Renting Rui, Huacan Chai, Lingyue Fu, Wei Xia, Yasheng Wang, Ruiming Tang, Yong Yu, Weinan Zhang,
- Abstract summary: We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
- Score: 58.84212778960507
- License:
- Abstract: Utilizing large language models to generate codes has shown promising meaning in software development revolution. Despite the intelligence shown by the general large language models, their specificity in code generation can still be improved due to the syntactic gap and mismatched vocabulary existing among natural language and different programming languages. In this paper, we propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs. CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language, which can facilitate natural language based LLMs for better understanding of code syntax and serve as a bridge among different programming languages. To take the extracted structural knowledge into the foundation models, we propose 1) a hard meta-graph prompt template to transform the challenging graphical representation into informative knowledge for tuning-free models and 2) a soft prompting technique that injects the domain knowledge of programming languages into the model parameters via finetuning the models with the help of a pretrained GNN expert model. Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert. CodeGRAG improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation. Code is available at https://anonymous.4open.science/r/Code-5970/.
Related papers
- Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities.
The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z) - Benchmarking LLM Code Generation for Audio Programming with Visual Dataflow Languages [1.559169421643164]
Node-based programming languages are increasingly popular in media arts coding domains.
Using LLM-based code generation to further lower the barrier to creative output is an exciting opportunity.
Best strategy for code generation for visual node-based programming languages is still an open question.
arXiv Detail & Related papers (2024-09-01T22:11:23Z) - Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation.
However, their generation speed is limited by the inherently sequential nature of their decoding process.
This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z) - IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs.
We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files.
Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language.
Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z) - Improving Natural Language Capability of Code Large Language Model [13.639938216171185]
We propose a novel framework, comprising two modules: AttentionExtractor and AttentionCoder.
AttentionExtractor is responsible for extracting key phrases from the user's natural language requirements, and AttentionCoder leverages these extracted phrases to generate target code.
To validate the effectiveness of the framework, we craft a new code generation benchmark, called MultiNL-H, covering five natural languages.
arXiv Detail & Related papers (2024-01-25T15:33:20Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for
Code Generation [22.219645213202178]
This paper proposes the "Semantic Chain-of-Thought" approach to intruduce semantic information of code, named SeCoT.
We show that SeCoT can achieves state-of-the-art performance, greatly improving the potential for large models and code generation.
arXiv Detail & Related papers (2023-10-16T05:09:58Z) - Language is All a Graph Needs [33.9836278881785]
We propose InstructGLM (Instruction-finetuned Graph Language Model) with highly scalable prompts based on natural language instructions.
Our method surpasses all GNN baselines on ogbn-arxiv, Cora and PubMed datasets.
arXiv Detail & Related papers (2023-08-14T13:41:09Z) - PanGu-Coder: Program Synthesis with Function-Level Language Modeling [47.63943623661298]
PanGu-Coder is a pretrained decoder-only language model adopting the PanGu-Alpha architecture for text-to-code generation.
We train PanGu-Coder using a two-stage strategy: the first stage employs Causal Language Modelling to pre-train on raw programming language data.
The second stage uses a combination of Causal Language Modelling and Masked Language Modelling to train on loosely curated pairs of natural language program definitions and code functions.
arXiv Detail & Related papers (2022-07-22T18:08:16Z) - Incorporating External Knowledge through Pre-training for Natural
Language to Code Generation [97.97049697457425]
Open-domain code generation aims to generate code in a general-purpose programming language from natural language (NL) intents.
We explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation.
Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa.
arXiv Detail & Related papers (2020-04-20T01:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.