Code Representation Learning with Pr\"ufer Sequences
- URL: http://arxiv.org/abs/2111.07263v1
- Date: Sun, 14 Nov 2021 07:27:38 GMT
- Title: Code Representation Learning with Pr\"ufer Sequences
- Authors: Tenzin Jinpa and Yong Gao
- Abstract summary: An effective encoding of the source code of a computer program is critical to the success of sequence-to-sequence deep neural network models.
We propose to use the Pr"ufer sequence of the Abstract Syntax Tree (AST) of a computer program to design a sequential representation scheme.
Our representation makes it possible to develop deep-learning models in which signals carried by lexical tokens in the training examples can be exploited automatically and selectively.
- Score: 2.2463154358632464
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An effective and efficient encoding of the source code of a computer program
is critical to the success of sequence-to-sequence deep neural network models
for tasks in computer program comprehension, such as automated code
summarization and documentation. A significant challenge is to find a
sequential representation that captures the structural/syntactic information in
a computer program and facilitates the training of the learning models.
In this paper, we propose to use the Pr\"ufer sequence of the Abstract Syntax
Tree (AST) of a computer program to design a sequential representation scheme
that preserves the structural information in an AST. Our representation makes
it possible to develop deep-learning models in which signals carried by lexical
tokens in the training examples can be exploited automatically and selectively
based on their syntactic role and importance. Unlike other recently-proposed
approaches, our representation is concise and lossless in terms of the
structural information of the AST. Empirical studies on real-world benchmark
datasets, using a sequence-to-sequence learning model we designed for code
summarization, show that our Pr\"ufer-sequence-based representation is indeed
highly effective and efficient, outperforming significantly all the
recently-proposed deep-learning models we used as the baseline models.
Related papers
- The Graph's Apprentice: Teaching an LLM Low Level Knowledge for Circuit Quality Estimation [34.37154877681809]
We introduce VeriDistill, the first end-to-end machine learning model that directly processes raw Verilog code to predict circuit quality-of-result metrics.
Our model employs a novel knowledge distillation method, transferring low-level circuit insights via graphs into the predictor based on LLM.
Experiments show VeriDistill outperforms state-of-the-art baselines on large-scale Verilog datasets.
arXiv Detail & Related papers (2024-10-30T04:20:10Z) - Abstract Syntax Tree for Programming Language Understanding and
Representation: How Far Are We? [23.52632194060246]
Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering.
The abstract syntax tree (AST), a fundamental code feature, illustrates the syntactic information of the source code and has been widely used in code representation learning.
We compare the performance of models trained with code token sequence (Token for short) based code representation and AST-based code representation on three popular types of code-related tasks.
arXiv Detail & Related papers (2023-12-01T08:37:27Z) - Sparse Attention-Based Neural Networks for Code Classification [15.296053323327312]
We introduce an approach named the Sparse Attention-based neural network for Code Classification (SACC)
In the first step, source code undergoes syntax parsing and preprocessing.
The encoded sequences of subtrees are fed into a Transformer model that incorporates sparse attention mechanisms for the purpose of classification.
arXiv Detail & Related papers (2023-11-11T14:07:12Z) - Learning ECG signal features without backpropagation [0.0]
We propose a novel method to generate representations for time series-type data.
This method relies on ideas from theoretical physics to construct a compact representation in a data-driven way.
We demonstrate the effectiveness of our approach on the task of ECG signal classification, achieving state-of-the-art performance.
arXiv Detail & Related papers (2023-07-04T21:35:49Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - Real-World Compositional Generalization with Disentangled
Sequence-to-Sequence Learning [81.24269148865555]
A recently proposed Disentangled sequence-to-sequence model (Dangle) shows promising generalization capability.
We introduce two key modifications to this model which encourage more disentangled representations and improve its compute and memory efficiency.
Specifically, instead of adaptively re-encoding source keys and values at each time step, we disentangle their representations and only re-encode keys periodically.
arXiv Detail & Related papers (2022-12-12T15:40:30Z) - Adaptive Convolutional Dictionary Network for CT Metal Artifact
Reduction [62.691996239590125]
We propose an adaptive convolutional dictionary network (ACDNet) for metal artifact reduction.
Our ACDNet can automatically learn the prior for artifact-free CT images via training data and adaptively adjust the representation kernels for each input CT image.
Our method inherits the clear interpretability of model-based methods and maintains the powerful representation ability of learning-based methods.
arXiv Detail & Related papers (2022-05-16T06:49:36Z) - Representation Learning for Sequence Data with Deep Autoencoding
Predictive Components [96.42805872177067]
We propose a self-supervised representation learning method for sequence data, based on the intuition that useful representations of sequence data should exhibit a simple structure in the latent space.
We encourage this latent structure by maximizing an estimate of predictive information of latent feature sequences, which is the mutual information between past and future windows at each time step.
We demonstrate that our method recovers the latent space of noisy dynamical systems, extracts predictive features for forecasting tasks, and improves automatic speech recognition when used to pretrain the encoder on large amounts of unlabeled data.
arXiv Detail & Related papers (2020-10-07T03:34:01Z) - DRTS Parsing with Structure-Aware Encoding and Decoding [28.711318411470497]
State-of-the-art performance can be achieved by a neural sequence-to-sequence model.
We propose a structural-aware model at both the encoder and decoder phase to integrate the structural information.
arXiv Detail & Related papers (2020-05-14T12:09:23Z) - Einsum Networks: Fast and Scalable Learning of Tractable Probabilistic
Circuits [99.59941892183454]
We propose Einsum Networks (EiNets), a novel implementation design for PCs.
At their core, EiNets combine a large number of arithmetic operations in a single monolithic einsum-operation.
We show that the implementation of Expectation-Maximization (EM) can be simplified for PCs, by leveraging automatic differentiation.
arXiv Detail & Related papers (2020-04-13T23:09:15Z) - Improved Code Summarization via a Graph Neural Network [96.03715569092523]
In general, source code summarization techniques use the source code as input and outputs a natural language description.
We present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries.
arXiv Detail & Related papers (2020-04-06T17:36:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.