Related papers: Towards Demystifying Dimensions of Source Code Embeddings

Towards Demystifying Dimensions of Source Code Embeddings

URL: http://arxiv.org/abs/2008.13064v3
Date: Tue, 29 Sep 2020 00:19:28 GMT
Title: Towards Demystifying Dimensions of Source Code Embeddings
Authors: Md Rafiqul Islam Rabin, Arjun Mukherjee, Omprakash Gnawali, Mohammad Amin Alipour
Abstract summary: We present our preliminary results towards better understanding the contents of code2vec neural source code embeddings. Our results suggest that the handcrafted features can perform very close to the highly-dimensional code2vec embeddings. We also find that the code2vec embeddings are more resilient to the removal of dimensions with low information gains than the handcrafted features.
Score: 5.211235558099913
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Source code representations are key in applying machine learning techniques for processing and analyzing programs. A popular approach in representing source code is neural source code embeddings that represents programs with high-dimensional vectors computed by training deep neural networks on a large volume of programs. Although successful, there is little known about the contents of these vectors and their characteristics. In this paper, we present our preliminary results towards better understanding the contents of code2vec neural source code embeddings. In particular, in a small case study, we use the code2vec embeddings to create binary SVM classifiers and compare their performance with the handcrafted features. Our results suggest that the handcrafted features can perform very close to the highly-dimensional code2vec embeddings, and the information gains are more evenly distributed in the code2vec embeddings compared to the handcrafted features. We also find that the code2vec embeddings are more resilient to the removal of dimensions with low information gains than the handcrafted features. We hope our results serve a stepping stone toward principled analysis and evaluation of these code representations.

Related papers

An Empirical Study on the Effectiveness of Large Language Models for Binary Code Understanding [50.17907898478795]
This work proposes a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in real-world reverse engineering scenarios. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2025-04-30T17:02:06Z)
On the Role of Pre-trained Embeddings in Binary Code Analysis [7.161446721947512]
Pre-trained embeddings of assembly code have become a gold standard for solving binary code analysis tasks. In contrast to natural language processing, label information is not scarce for many tasks in binary code analysis. We systematically evaluate recent embeddings for assembly code on five downstream tasks using a corpus of 1.2 million functions.
arXiv Detail & Related papers (2025-02-12T10:50:46Z)
How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z)
Linear Codes for Hyperdimensional Computing [9.7902367664742]
We show that random linear codes offer a rich subcode structure that can be used to form key-value stores. We show that under the framework we develop, random linear codes admit simple recovery algorithms to factor (either bundled or bound) compositional representations.
arXiv Detail & Related papers (2024-03-05T19:18:44Z)
Enhancing Source Code Representations for Deep Learning with Static Analysis [10.222207222039048]
This paper explores the integration of static analysis and additional context such as bug reports and design patterns into source code representations for deep learning models. We use the Abstract Syntax Tree-based Neural Network (ASTNN) method and augment it with additional context information obtained from bug reports and design patterns. Our approach improves the representation and processing of source code, thereby improving task performance.
arXiv Detail & Related papers (2024-02-14T20:17:04Z)
Code Representation Learning At Scale [75.04686476303436]
We fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme. We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language. We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner.
arXiv Detail & Related papers (2024-02-02T22:19:15Z)
Leveraging Generative AI: Improving Software Metadata Classification with Generated Code-Comment Pairs [0.0]
In software development, code comments play a crucial role in enhancing code comprehension and collaboration. This research paper addresses the challenge of objectively classifying code comments as "Useful" or "Not Useful" We propose a novel solution that harnesses contextualized embeddings, particularly BERT, to automate this classification process.
arXiv Detail & Related papers (2023-10-14T12:09:43Z)
A Neural Network Architecture for Program Understanding Inspired by Human Behaviors [10.745648153049965]
We present a partitioning-based graph neural network model PGNN on the upgraded AST of codes. We transform raw codes with external knowledge and apply pre-training techniques for information extraction. We conduct extensive experiments to show the superior performance of PGNN-EK on the code summarization and code clone detection tasks.
arXiv Detail & Related papers (2022-05-10T06:53:45Z)
GypSum: Learning Hybrid Representations for Code Summarization [21.701127410434914]
GypSum is a new deep learning model that learns hybrid representations using graph attention neural networks and a pre-trained programming and natural language model. We modify the encoder-decoder sublayer in the Transformer's decoder to fuse the representations and propose a dual-copy mechanism to facilitate summary generation.
arXiv Detail & Related papers (2022-04-26T07:44:49Z)
LLC: Accurate, Multi-purpose Learnt Low-dimensional Binary Codes [55.32790803903619]
We propose a novel method for Learning Low-dimensional binary Codes (LLC) for instances as well as classes. Our method does not require any side-information, like annotated attributes or label meta-data. We demonstrate that the learnt codes capture intrinsically important features in the data, by discovering an intuitive taxonomy over classes.
arXiv Detail & Related papers (2021-06-02T21:57:52Z)
Learning to map source code to software vulnerability using code-as-a-graph [67.62847721118142]
We explore the applicability of Graph Neural Networks in learning the nuances of source code from a security perspective. We show that a code-as-graph encoding is more meaningful for vulnerability detection than existing code-as-photo and linear sequence encoding approaches.
arXiv Detail & Related papers (2020-06-15T16:05:27Z)
Improved Code Summarization via a Graph Neural Network [96.03715569092523]
In general, source code summarization techniques use the source code as input and outputs a natural language description. We present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries.
arXiv Detail & Related papers (2020-04-06T17:36:42Z)
Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph. It is updated by decoding in the context of an auto-encoder. Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.