AI Chain on Large Language Model for Unsupervised Control Flow Graph
Generation for Statically-Typed Partial Code
- URL: http://arxiv.org/abs/2306.00757v1
- Date: Thu, 1 Jun 2023 14:52:59 GMT
- Title: AI Chain on Large Language Model for Unsupervised Control Flow Graph
Generation for Statically-Typed Partial Code
- Authors: Qing Huang, Zhou Zou, Zhenchang Xing, Zhenkang Zuo, Xiwei Xu, Qinghua
Lu
- Abstract summary: Control Flow Graphs (CFGs) are essential for visualizing, understanding and analyzing program behavior.
We propose a novel approach that leverages the error-tolerant and understanding ability of pre-trained Large Language Models (LLMs) to generate CFGs.
- Score: 21.423928174875844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Control Flow Graphs (CFGs) are essential for visualizing, understanding and
analyzing program behavior. For statically-typed programming language like
Java, developers obtain CFGs by using bytecode-based methods for compilable
code and Abstract Syntax Tree (AST)-based methods for partially uncompilable
code. However, explicit syntax errors during AST construction and implicit
semantic errors caused by bad coding practices can lead to behavioral loss and
deviation of CFGs.To address the issue, we propose a novel approach that
leverages the error-tolerant and understanding ability of pre-trained Large
Language Models (LLMs) to generate CFGs. Our approach involves a Chain of
Thought (CoT) with four steps: structure hierarchy extraction, nested code
block extraction, CFG generation of nested code blocks, and fusion of all
nested code blocks' CFGs. To address the limitations of the original CoT's
single-prompt approach (i.e., completing all steps in a single generative
pass), which can result in an ``epic'' prompt with hard-to-control behavior and
error accumulation, we break down the CoT into an AI chain with explicit
sub-steps. Each sub-step corresponds to a separate AI-unit, with an effective
prompt assigned to each unit for interacting with LLMs to accomplish a specific
purpose.Our experiments confirmed that our method outperforms existing CFG
tools in terms of node and edge coverage, especially for incomplete or
erroneous code. We also conducted an ablation experiment and confirmed the
effectiveness of AI chain design principles: Hierarchical Task Breakdown, Unit
Composition, and Mix of AI Units and Non-AI Units.Our work opens up new
possibilities for building foundational software engineering tools based on
LLMs, as opposed to traditional program analysis methods.
Related papers
- VISUALCODER: Guiding Large Language Models in Code Execution with Fine-grained Multimodal Chain-of-Thought Reasoning [10.70881967278009]
We introduce Visual Coder, a simple yet effective approach that enhances code reasoning by integrating multimodal Chain-of-Thought (CoT) reasoning with a visual Control Flow Graph (CFG)
By aligning code snippets with their corresponding CFGs, Visual Coder provides deeper insights into execution flow, enabling more accurate predictions of code behavior.
Our experiments demonstrate that augmenting LLMs with visual CFGs significantly outperforms text-based CFG descriptions in code reasoning tasks.
arXiv Detail & Related papers (2024-10-30T19:07:01Z) - SGC-VQGAN: Towards Complex Scene Representation via Semantic Guided Clustering Codebook [9.993066868670283]
We introduce SGC-VQGAN through Semantic Online Clustering method to enhance token semantics through Consistent Semantic Learning.
Our approach constructs a temporospatially consistent semantic codebook, addressing issues of codebook collapse and imbalanced token semantics.
arXiv Detail & Related papers (2024-09-09T23:12:43Z) - StepCoder: Improve Code Generation with Reinforcement Learning from
Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components.
CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks.
FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization.
Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z) - GEC-DePenD: Non-Autoregressive Grammatical Error Correction with
Decoupled Permutation and Decoding [52.14832976759585]
Grammatical error correction (GEC) is an important NLP task that is usually solved with autoregressive sequence-to-sequence models.
We propose a novel non-autoregressive approach to GEC that decouples the architecture into a permutation network.
We show that the resulting network improves over previously known non-autoregressive methods for GEC.
arXiv Detail & Related papers (2023-11-14T14:24:36Z) - Coding by Design: GPT-4 empowers Agile Model Driven Development [0.03683202928838613]
This research offers an Agile Model-Driven Development (MDD) approach that enhances code auto-generation using OpenAI's GPT-4.
Our work emphasizes "Agility" as a significant contribution to the current MDD method, particularly when the model undergoes changes or needs deployment in a different programming language.
Ultimately, leveraging GPT-4, our last layer auto-generates code in both Java and Python.
arXiv Detail & Related papers (2023-10-06T15:05:05Z) - When Do Program-of-Thoughts Work for Reasoning? [51.2699797837818]
We propose complexity-impacted reasoning score (CIRS) to measure correlation between code and reasoning abilities.
Specifically, we use the abstract syntax tree to encode the structural information and calculate logical complexity.
Code will be integrated into the EasyInstruct framework at https://github.com/zjunlp/EasyInstruct.
arXiv Detail & Related papers (2023-08-29T17:22:39Z) - A Chain of AI-based Solutions for Resolving FQNs and Fixing Syntax
Errors in Partial Code [20.5627916036]
API documentation, technical blogs and programming Q&A sites contain numerous partial code that can be reused in programming tasks, but often these code are uncompilable due to unresolved names and syntax errors.
We propose the Partial Code Reuse Chain (PCR-Chain) for resolving fully-qualified names (FQNs) and fixing last-mile syntax errors in partial code based on a giant large language model (LLM) like ChatGPT.
arXiv Detail & Related papers (2023-06-21T02:13:32Z) - Physics of Language Models: Part 1, Learning Hierarchical Language Structures [51.68385617116854]
Transformer-based language models are effective but complex, and understanding their inner workings is a significant challenge.
We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating lengthy sentences.
We demonstrate that generative models like GPT can accurately learn this CFG language and generate sentences based on it.
arXiv Detail & Related papers (2023-05-23T04:28:16Z) - CodeT5+: Open Code Large Language Models for Code Understanding and
Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence.
CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks.
We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z) - Precise Learning of Source Code Contextual Semantics via Hierarchical
Dependence Structure and Graph Attention Networks [28.212889828892664]
We propose a novel source code model embedded with hierarchical dependencies.
We introduce the syntactic structural of the basic block, i.e., its corresponding AST, in source code model to provide sufficient information.
The results show that our model reduces the scale of parameters by 50% and achieves 4% improvement on accuracy on program classification task.
arXiv Detail & Related papers (2021-11-20T04:03:42Z) - Software Vulnerability Detection via Deep Learning over Disaggregated
Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora.
Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.