CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation
- URL: http://arxiv.org/abs/2405.02355v4
- Date: Mon, 19 May 2025 05:50:15 GMT
- Title: CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation
- Authors: Kounianhua Du, Jizheng Chen, Renting Rui, Huacan Chai, Lingyue Fu, Wei Xia, Yasheng Wang, Ruiming Tang, Yong Yu, Weinan Zhang,
- Abstract summary: CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to better interpret the programming domain knowledge.<n>CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation.
- Score: 58.84212778960507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Utilizing large language models to generate codes has shown promising meaning in software development revolution. Despite the intelligence shown by the large language models, their specificity in code generation can still be improved due to the syntactic gap and mismatched vocabulary existing between natural language and programming languages. In this paper, we propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework that bridges the gap between NL and PL to enhance the performance of LLMs. CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to better interpret the programming domain knowledge, which can facilitate natural language based LLMs for better understanding of code syntax and serve as a bridge among different programming languages. To take the extracted structural knowledge into the foundation models, we propose 1) a hard meta-graph prompt template to transform the challenging syntax graph into informative graphical view for tuning-free models and 2) a soft prompting technique that injects the domain knowledge of programming languages into model parameters via finetuning the models with the soft signals encoded by GNN expert model. Specifically, two constraints are designed to improve the alignment and structure expressiveness, contributing to the informativeness of the single-token-sized external <GraphEmb> for enhanced code generation. CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation. Implementation is available at https://anonymous.4open.science/r/Code-5970/ .
Related papers
- IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z) - Type-Constrained Code Generation with Language Models [51.03439021895432]
We introduce a type-constrained decoding approach that leverages type systems to guide code generation.<n>For this purpose, we develop novel prefix automata and a search over inhabitable types, forming a sound approach to enforce well-typedness on LLM-generated code.<n>Our approach reduces compilation errors by more than half and significantly increases functional correctness in code synthesis, translation, and repair tasks.
arXiv Detail & Related papers (2025-04-12T15:03:00Z) - Evaluating Programming Language Confusion [6.462594894731934]
Large Language Models for code (Code LLMs) have gained significant traction in software engineering.
These models have demonstrated remarkable capabilities in understanding programming concepts, implementing algorithms, and even bridging different programming languages.
Despite these advances, Code LLMs often struggle with programming language confusion--producing code in unintended languages.
arXiv Detail & Related papers (2025-03-17T18:14:15Z) - Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities.
The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z) - GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding [20.12647254668254]
Recent code language models have scaled to billions of parameters, but model source code solely as text tokens.<n>We take the best of both worlds with GALLa - Graph Aligned Large Language Models.
arXiv Detail & Related papers (2024-09-06T10:57:34Z) - Benchmarking LLM Code Generation for Audio Programming with Visual Dataflow Languages [1.559169421643164]
Node-based programming languages are increasingly popular in media arts coding domains.
Using LLM-based code generation to further lower the barrier to creative output is an exciting opportunity.
Best strategy for code generation for visual node-based programming languages is still an open question.
arXiv Detail & Related papers (2024-09-01T22:11:23Z) - Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation.
However, their generation speed is limited by the inherently sequential nature of their decoding process.
This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z) - IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs.
We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files.
Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language.
Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z) - Improving Natural Language Capability of Code Large Language Model [13.639938216171185]
We propose a novel framework, comprising two modules: AttentionExtractor and AttentionCoder.
AttentionExtractor is responsible for extracting key phrases from the user's natural language requirements, and AttentionCoder leverages these extracted phrases to generate target code.
To validate the effectiveness of the framework, we craft a new code generation benchmark, called MultiNL-H, covering five natural languages.
arXiv Detail & Related papers (2024-01-25T15:33:20Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for
Code Generation [22.219645213202178]
This paper proposes the "Semantic Chain-of-Thought" approach to intruduce semantic information of code, named SeCoT.
We show that SeCoT can achieves state-of-the-art performance, greatly improving the potential for large models and code generation.
arXiv Detail & Related papers (2023-10-16T05:09:58Z) - Language is All a Graph Needs [33.9836278881785]
We propose InstructGLM (Instruction-finetuned Graph Language Model) with highly scalable prompts based on natural language instructions.
Our method surpasses all GNN baselines on ogbn-arxiv, Cora and PubMed datasets.
arXiv Detail & Related papers (2023-08-14T13:41:09Z) - PanGu-Coder: Program Synthesis with Function-Level Language Modeling [47.63943623661298]
PanGu-Coder is a pretrained decoder-only language model adopting the PanGu-Alpha architecture for text-to-code generation.
We train PanGu-Coder using a two-stage strategy: the first stage employs Causal Language Modelling to pre-train on raw programming language data.
The second stage uses a combination of Causal Language Modelling and Masked Language Modelling to train on loosely curated pairs of natural language program definitions and code functions.
arXiv Detail & Related papers (2022-07-22T18:08:16Z) - Incorporating External Knowledge through Pre-training for Natural
Language to Code Generation [97.97049697457425]
Open-domain code generation aims to generate code in a general-purpose programming language from natural language (NL) intents.
We explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation.
Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa.
arXiv Detail & Related papers (2020-04-20T01:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.