TRACED: Execution-aware Pre-training for Source Code
- URL: http://arxiv.org/abs/2306.07487v1
- Date: Tue, 13 Jun 2023 01:30:14 GMT
- Title: TRACED: Execution-aware Pre-training for Source Code
- Authors: Yangruibo Ding, Ben Steenhoek, Kexin Pei, Gail Kaiser, Wei Le,
Baishakhi Ray
- Abstract summary: We introduce TRACED, an execution-aware pre-training strategy for source code.
Our goal is to teach code models the complicated execution logic during the pre-training, enabling the model to statically estimate the dynamic code properties.
TRACED relatively improves the statically pre-trained code models by 12.4% for complete execution path prediction and by 25.2% for runtime variable value predictions.
- Score: 24.101763959136058
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most existing pre-trained language models for source code focus on learning
the static code text, typically augmented with static code structures (abstract
syntax tree, dependency graphs, etc.). However, program semantics will not be
fully exposed before the real execution. Without an understanding of the
program execution, statically pre-trained models fail to comprehensively
capture the dynamic code properties, such as the branch coverage and the
runtime variable values, and they are consequently less effective at code
understanding tasks, such as retrieving semantic clones and detecting software
vulnerabilities.
To close the gap between the static nature of language models and the dynamic
characteristics of programs, we introduce TRACED, an execution-aware
pre-training strategy for source code. Specifically, we pre-train code language
models with a combination of source code, executable inputs, and corresponding
execution traces. Our goal is to teach code models the complicated execution
logic during the pre-training, enabling the model to statically estimate the
dynamic code properties without repeatedly executing code during task-specific
fine-tuning.
To illustrate the effectiveness of our proposed approach, we fine-tune and
evaluate TRACED on three downstream tasks: static execution estimation, clone
retrieval, and vulnerability detection. The empirical results show that TRACED
relatively improves the statically pre-trained code models by 12.4% for
complete execution path prediction and by 25.2% for runtime variable value
predictions. TRACED also significantly outperforms statically pre-trained
models in clone retrieval and vulnerability detection across four public
benchmarks.
Related papers
- VISUALCODER: Guiding Large Language Models in Code Execution with Fine-grained Multimodal Chain-of-Thought Reasoning [10.70881967278009]
We introduce Visual Coder, a simple yet effective approach that enhances code reasoning by integrating multimodal Chain-of-Thought (CoT) reasoning with a visual Control Flow Graph (CFG)
By aligning code snippets with their corresponding CFGs, Visual Coder provides deeper insights into execution flow, enabling more accurate predictions of code behavior.
Our experiments demonstrate that augmenting LLMs with visual CFGs significantly outperforms text-based CFG descriptions in code reasoning tasks.
arXiv Detail & Related papers (2024-10-30T19:07:01Z) - Learning to Predict Program Execution by Modeling Dynamic Dependency on Code Graphs [11.347234752942684]
This paper introduces a novel machine learning-based framework called CodeFlow to predict code coverage and detect runtime errors.
CodeFlow represents all possible execution paths and the relationships between different statements.
It learns dynamic dependencies through execution traces, which reflect the impacts among statements during execution.
arXiv Detail & Related papers (2024-08-05T20:32:00Z) - Code Representation Learning At Scale [75.04686476303436]
We fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme.
We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language.
We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner.
arXiv Detail & Related papers (2024-02-02T22:19:15Z) - Towards Safe Automated Refactoring of Imperative Deep Learning Programs
to Graph Execution [4.786072763033669]
More natural, less error-prone imperative DL frameworks encouraging eager execution have emerged at the expense of run-time performance.
We present our ongoing work on an automated approach that assists developers in specifying whether and how their otherwise imperative DL code could be reliably and efficiently executed as graphs.
The approach is being implemented as a PyDev Eclipse plug-in and uses the WALA Ariadne analysis framework.
arXiv Detail & Related papers (2023-08-22T20:50:19Z) - CONCORD: Clone-aware Contrastive Learning for Source Code [64.51161487524436]
Self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks.
We argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning.
In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart.
arXiv Detail & Related papers (2023-06-05T20:39:08Z) - A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems.
static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models.
We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z) - Code Execution with Pre-trained Language Models [88.04688617516827]
Most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures.
We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution.
We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension.
arXiv Detail & Related papers (2023-05-08T10:00:05Z) - Towards Efficient Fine-tuning of Pre-trained Code Models: An
Experimental Study and Beyond [52.656743602538825]
Fine-tuning pre-trained code models incurs a large computational cost.
We conduct an experimental study to explore what happens to layer-wise pre-trained representations and their encoded code knowledge during fine-tuning.
We propose Telly to efficiently fine-tune pre-trained code models via layer freezing.
arXiv Detail & Related papers (2023-04-11T13:34:13Z) - CLSEBERT: Contrastive Learning for Syntax Enhanced Code Pre-Trained
Model [23.947178895479464]
We propose CLSEBERT, a Constrastive Learning Framework for Syntax Enhanced Code Pre-Trained Model.
In the pre-training stage, we consider the code syntax and hierarchy contained in the Abstract Syntax Tree (AST)
We also introduce two novel pre-training objectives. One is to predict the edges between nodes in the abstract syntax tree, and the other is to predict the types of code tokens.
arXiv Detail & Related papers (2021-08-10T10:08:21Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.