Towards Demystifying Dimensions of Source Code Embeddings
- URL: http://arxiv.org/abs/2008.13064v3
- Date: Tue, 29 Sep 2020 00:19:28 GMT
- Title: Towards Demystifying Dimensions of Source Code Embeddings
- Authors: Md Rafiqul Islam Rabin, Arjun Mukherjee, Omprakash Gnawali, Mohammad
Amin Alipour
- Abstract summary: We present our preliminary results towards better understanding the contents of code2vec neural source code embeddings.
Our results suggest that the handcrafted features can perform very close to the highly-dimensional code2vec embeddings.
We also find that the code2vec embeddings are more resilient to the removal of dimensions with low information gains than the handcrafted features.
- Score: 5.211235558099913
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Source code representations are key in applying machine learning techniques
for processing and analyzing programs. A popular approach in representing
source code is neural source code embeddings that represents programs with
high-dimensional vectors computed by training deep neural networks on a large
volume of programs. Although successful, there is little known about the
contents of these vectors and their characteristics. In this paper, we present
our preliminary results towards better understanding the contents of code2vec
neural source code embeddings. In particular, in a small case study, we use the
code2vec embeddings to create binary SVM classifiers and compare their
performance with the handcrafted features. Our results suggest that the
handcrafted features can perform very close to the highly-dimensional code2vec
embeddings, and the information gains are more evenly distributed in the
code2vec embeddings compared to the handcrafted features. We also find that the
code2vec embeddings are more resilient to the removal of dimensions with low
information gains than the handcrafted features. We hope our results serve a
stepping stone toward principled analysis and evaluation of these code
representations.
Related papers
- How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z) - Linear Codes for Hyperdimensional Computing [9.7902367664742]
We show that random linear codes offer a rich subcode structure that can be used to form key-value stores.
We show that under the framework we develop, random linear codes admit simple recovery algorithms to factor (either bundled or bound) compositional representations.
arXiv Detail & Related papers (2024-03-05T19:18:44Z) - Enhancing Source Code Representations for Deep Learning with Static
Analysis [10.222207222039048]
This paper explores the integration of static analysis and additional context such as bug reports and design patterns into source code representations for deep learning models.
We use the Abstract Syntax Tree-based Neural Network (ASTNN) method and augment it with additional context information obtained from bug reports and design patterns.
Our approach improves the representation and processing of source code, thereby improving task performance.
arXiv Detail & Related papers (2024-02-14T20:17:04Z) - Code Representation Learning At Scale [75.04686476303436]
We fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme.
We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language.
We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner.
arXiv Detail & Related papers (2024-02-02T22:19:15Z) - Leveraging Generative AI: Improving Software Metadata Classification
with Generated Code-Comment Pairs [0.0]
In software development, code comments play a crucial role in enhancing code comprehension and collaboration.
This research paper addresses the challenge of objectively classifying code comments as "Useful" or "Not Useful"
We propose a novel solution that harnesses contextualized embeddings, particularly BERT, to automate this classification process.
arXiv Detail & Related papers (2023-10-14T12:09:43Z) - A Neural Network Architecture for Program Understanding Inspired by
Human Behaviors [10.745648153049965]
We present a partitioning-based graph neural network model PGNN on the upgraded AST of codes.
We transform raw codes with external knowledge and apply pre-training techniques for information extraction.
We conduct extensive experiments to show the superior performance of PGNN-EK on the code summarization and code clone detection tasks.
arXiv Detail & Related papers (2022-05-10T06:53:45Z) - GypSum: Learning Hybrid Representations for Code Summarization [21.701127410434914]
GypSum is a new deep learning model that learns hybrid representations using graph attention neural networks and a pre-trained programming and natural language model.
We modify the encoder-decoder sublayer in the Transformer's decoder to fuse the representations and propose a dual-copy mechanism to facilitate summary generation.
arXiv Detail & Related papers (2022-04-26T07:44:49Z) - LLC: Accurate, Multi-purpose Learnt Low-dimensional Binary Codes [55.32790803903619]
We propose a novel method for Learning Low-dimensional binary Codes (LLC) for instances as well as classes.
Our method does not require any side-information, like annotated attributes or label meta-data.
We demonstrate that the learnt codes capture intrinsically important features in the data, by discovering an intuitive taxonomy over classes.
arXiv Detail & Related papers (2021-06-02T21:57:52Z) - Learning to map source code to software vulnerability using
code-as-a-graph [67.62847721118142]
We explore the applicability of Graph Neural Networks in learning the nuances of source code from a security perspective.
We show that a code-as-graph encoding is more meaningful for vulnerability detection than existing code-as-photo and linear sequence encoding approaches.
arXiv Detail & Related papers (2020-06-15T16:05:27Z) - Improved Code Summarization via a Graph Neural Network [96.03715569092523]
In general, source code summarization techniques use the source code as input and outputs a natural language description.
We present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries.
arXiv Detail & Related papers (2020-04-06T17:36:42Z) - Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph.
It is updated by decoding in the context of an auto-encoder.
Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.