Related papers: Code Representation Learning with Pr\"ufer Sequences

Code Representation Learning with Pr\"ufer Sequences

URL: http://arxiv.org/abs/2111.07263v1
Date: Sun, 14 Nov 2021 07:27:38 GMT
Title: Code Representation Learning with Pr\"ufer Sequences
Authors: Tenzin Jinpa and Yong Gao
Abstract summary: An effective encoding of the source code of a computer program is critical to the success of sequence-to-sequence deep neural network models. We propose to use the Pr"ufer sequence of the Abstract Syntax Tree (AST) of a computer program to design a sequential representation scheme. Our representation makes it possible to develop deep-learning models in which signals carried by lexical tokens in the training examples can be exploited automatically and selectively.
Score: 2.2463154358632464
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: An effective and efficient encoding of the source code of a computer program is critical to the success of sequence-to-sequence deep neural network models for tasks in computer program comprehension, such as automated code summarization and documentation. A significant challenge is to find a sequential representation that captures the structural/syntactic information in a computer program and facilitates the training of the learning models. In this paper, we propose to use the Pr\"ufer sequence of the Abstract Syntax Tree (AST) of a computer program to design a sequential representation scheme that preserves the structural information in an AST. Our representation makes it possible to develop deep-learning models in which signals carried by lexical tokens in the training examples can be exploited automatically and selectively based on their syntactic role and importance. Unlike other recently-proposed approaches, our representation is concise and lossless in terms of the structural information of the AST. Empirical studies on real-world benchmark datasets, using a sequence-to-sequence learning model we designed for code summarization, show that our Pr\"ufer-sequence-based representation is indeed highly effective and efficient, outperforming significantly all the recently-proposed deep-learning models we used as the baseline models.

Related papers

The Graph's Apprentice: Teaching an LLM Low Level Knowledge for Circuit Quality Estimation [34.37154877681809]
We introduce VeriDistill, the first end-to-end machine learning model that directly processes raw Verilog code to predict circuit quality-of-result metrics. Our model employs a novel knowledge distillation method, transferring low-level circuit insights via graphs into the predictor based on LLM. Experiments show VeriDistill outperforms state-of-the-art baselines on large-scale Verilog datasets.
arXiv Detail & Related papers (2024-10-30T04:20:10Z)
Large Language Model for Verilog Generation with Code-Structure-Guided Reinforcement Learning [29.135207235743795]
This paper introduces VeriSeek, an LLM enhanced by reinforcement learning to achieve high Verilog code generation performance. Our reinforcement learning approach employs code structure information as feedback signals to refine the pre-trained model. Experiments show that VeriSeek outperforms state-of-the-art methods across multiple benchmarks.
arXiv Detail & Related papers (2024-07-21T11:25:21Z)
On the Sequence Evaluation based on Stochastic Processes [17.497842325320825]
We propose a novel approach to learn the dynamics of long text sequences, utilizing a negative log-likelihood-based encoder. We also introduce a likelihood-based evaluation metric for long-text assessment, which measures sequence coherence.
arXiv Detail & Related papers (2024-05-28T02:33:38Z)
Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We? [23.52632194060246]
Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering. The abstract syntax tree (AST), a fundamental code feature, illustrates the syntactic information of the source code and has been widely used in code representation learning. We compare the performance of models trained with code token sequence (Token for short) based code representation and AST-based code representation on three popular types of code-related tasks.
arXiv Detail & Related papers (2023-12-01T08:37:27Z)
Sparse Attention-Based Neural Networks for Code Classification [15.296053323327312]
We introduce an approach named the Sparse Attention-based neural network for Code Classification (SACC) In the first step, source code undergoes syntax parsing and preprocessing. The encoded sequences of subtrees are fed into a Transformer model that incorporates sparse attention mechanisms for the purpose of classification.
arXiv Detail & Related papers (2023-11-11T14:07:12Z)
Learning ECG signal features without backpropagation [0.0]
We propose a novel method to generate representations for time series-type data. This method relies on ideas from theoretical physics to construct a compact representation in a data-driven way. We demonstrate the effectiveness of our approach on the task of ECG signal classification, achieving state-of-the-art performance.
arXiv Detail & Related papers (2023-07-04T21:35:49Z)
Scalable Learning of Latent Language Structure With Logical Offline Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text. As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z)
Real-World Compositional Generalization with Disentangled Sequence-to-Sequence Learning [81.24269148865555]
A recently proposed Disentangled sequence-to-sequence model (Dangle) shows promising generalization capability. We introduce two key modifications to this model which encourage more disentangled representations and improve its compute and memory efficiency. Specifically, instead of adaptively re-encoding source keys and values at each time step, we disentangle their representations and only re-encode keys periodically.
arXiv Detail & Related papers (2022-12-12T15:40:30Z)
Adaptive Convolutional Dictionary Network for CT Metal Artifact Reduction [62.691996239590125]
We propose an adaptive convolutional dictionary network (ACDNet) for metal artifact reduction. Our ACDNet can automatically learn the prior for artifact-free CT images via training data and adaptively adjust the representation kernels for each input CT image. Our method inherits the clear interpretability of model-based methods and maintains the powerful representation ability of learning-based methods.
arXiv Detail & Related papers (2022-05-16T06:49:36Z)
Representation Learning for Sequence Data with Deep Autoencoding Predictive Components [96.42805872177067]
We propose a self-supervised representation learning method for sequence data, based on the intuition that useful representations of sequence data should exhibit a simple structure in the latent space. We encourage this latent structure by maximizing an estimate of predictive information of latent feature sequences, which is the mutual information between past and future windows at each time step. We demonstrate that our method recovers the latent space of noisy dynamical systems, extracts predictive features for forecasting tasks, and improves automatic speech recognition when used to pretrain the encoder on large amounts of unlabeled data.
arXiv Detail & Related papers (2020-10-07T03:34:01Z)
DRTS Parsing with Structure-Aware Encoding and Decoding [28.711318411470497]
State-of-the-art performance can be achieved by a neural sequence-to-sequence model. We propose a structural-aware model at both the encoder and decoder phase to integrate the structural information.
arXiv Detail & Related papers (2020-05-14T12:09:23Z)
Einsum Networks: Fast and Scalable Learning of Tractable Probabilistic Circuits [99.59941892183454]
We propose Einsum Networks (EiNets), a novel implementation design for PCs. At their core, EiNets combine a large number of arithmetic operations in a single monolithic einsum-operation. We show that the implementation of Expectation-Maximization (EM) can be simplified for PCs, by leveraging automatic differentiation.
arXiv Detail & Related papers (2020-04-13T23:09:15Z)
Improved Code Summarization via a Graph Neural Network [96.03715569092523]
In general, source code summarization techniques use the source code as input and outputs a natural language description. We present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries.
arXiv Detail & Related papers (2020-04-06T17:36:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.