Embedding Java Classes with code2vec: Improvements from Variable
Obfuscation
- URL: http://arxiv.org/abs/2004.02942v1
- Date: Mon, 6 Apr 2020 19:05:18 GMT
- Title: Embedding Java Classes with code2vec: Improvements from Variable
Obfuscation
- Authors: Rhys Compton, Eibe Frank, Panos Patros, Abigail Koay
- Abstract summary: This paper investigates the effect of obfuscating variable names during the training of a code2vec model to force it to rely on the structure of the code rather than specific names.
Our results, obtained on a challenging new collection of source-code classification problems, indicate that obfuscating variable names produces an embedding model that is both impervious to variable naming and more accurately reflects code semantics.
- Score: 0.716879432974126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic source code analysis in key areas of software engineering, such as
code security, can benefit from Machine Learning (ML). However, many standard
ML approaches require a numeric representation of data and cannot be applied
directly to source code. Thus, to enable ML, we need to embed source code into
numeric feature vectors while maintaining the semantics of the code as much as
possible. code2vec is a recently released embedding approach that uses the
proxy task of method name prediction to map Java methods to feature vectors.
However, experimentation with code2vec shows that it learns to rely on variable
names for prediction, causing it to be easily fooled by typos or adversarial
attacks. Moreover, it is only able to embed individual Java methods and cannot
embed an entire collection of methods such as those present in a typical Java
class, making it difficult to perform predictions at the class level (e.g., for
the identification of malicious Java classes). Both shortcomings are addressed
in the research presented in this paper. We investigate the effect of
obfuscating variable names during the training of a code2vec model to force it
to rely on the structure of the code rather than specific names and consider a
simple approach to creating class-level embeddings by aggregating sets of
method embeddings. Our results, obtained on a challenging new collection of
source-code classification problems, indicate that obfuscating variable names
produces an embedding model that is both impervious to variable naming and more
accurately reflects code semantics. The datasets, models, and code are shared
for further ML research on source code.
Related papers
- STRIDE: Simple Type Recognition In Decompiled Executables [16.767295743254458]
We propose STRIDE, a technique that predicts variable names and types by matching sequences of decompiler tokens to those found in training data.
We evaluate it on three benchmark datasets and find that STRIDE achieves comparable performance to state-of-the-art machine learning models for both variable retyping and renaming.
arXiv Detail & Related papers (2024-07-03T01:09:41Z) - Variables are a Curse in Software Vulnerability Prediction [4.453430599945387]
We introduce a new type of edge called name dependence, a type of abstract syntax graph based on the name dependence, and an efficient node representation method named 3-property encoding scheme.
These techniques will allow us to remove the concrete variable names from code, and facilitate deep learning models to learn the functionality of software hidden in diverse code expressions.
arXiv Detail & Related papers (2024-06-18T16:02:29Z) - Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery [50.564146730579424]
We propose a Text Embedding Synthesizer (TES) to generate pseudo text embeddings for unlabelled samples.
Our method unlocks the multi-modal potentials of CLIP and outperforms the baseline methods by a large margin on all GCD benchmarks.
arXiv Detail & Related papers (2024-03-15T02:40:13Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - Multi-Label Knowledge Distillation [86.03990467785312]
We propose a novel multi-label knowledge distillation method.
On one hand, it exploits the informative semantic knowledge from the logits by dividing the multi-label learning problem into a set of binary classification problems.
On the other hand, it enhances the distinctiveness of the learned feature representations by leveraging the structural information of label-wise embeddings.
arXiv Detail & Related papers (2023-08-12T03:19:08Z) - Multi-Instance Partial-Label Learning: Towards Exploiting Dual Inexact
Supervision [53.530957567507365]
In some real-world tasks, each training sample is associated with a candidate label set that contains one ground-truth label and some false positive labels.
In this paper, we formalize such problems as multi-instance partial-label learning (MIPL)
Existing multi-instance learning algorithms and partial-label learning algorithms are suboptimal for solving MIPL problems.
arXiv Detail & Related papers (2022-12-18T03:28:51Z) - CLAWSAT: Towards Both Robust and Accurate Code Models [74.57590254102311]
We integrate contrastive learning (CL) with adversarial learning to co-optimize the robustness and accuracy of code models.
To the best of our knowledge, this is the first systematic study to explore and exploit the robustness and accuracy benefits of (multi-view) code obfuscations in code models.
arXiv Detail & Related papers (2022-11-21T18:32:50Z) - VarCLR: Variable Semantic Representation Pre-training via Contrastive
Learning [84.70916463298109]
VarCLR is a new approach for learning semantic representations of variable names.
VarCLR is an excellent fit for contrastive learning, which aims to minimize the distance between explicitly similar inputs.
We show that VarCLR enables the effective application of sophisticated, general-purpose language models like BERT.
arXiv Detail & Related papers (2021-12-05T18:40:32Z) - BERT2Code: Can Pretrained Language Models be Leveraged for Code Search? [0.7953229555481884]
We show that our model learns the inherent relationship between the embedding spaces and further probes into the scope of improvement.
In this analysis, we show that the quality of the code embedding model is the bottleneck for our model's performance.
arXiv Detail & Related papers (2021-04-16T10:28:27Z) - Variable Name Recovery in Decompiled Binary Code using Constrained
Masked Language Modeling [17.377157455292817]
Decompilation is the procedure of transforming binary programs into a high-level representation, such as source code, for human analysts to examine.
We propose a novel solution to infer variable names in decompiled code based on Masked Language Modeling, Byte-Pair.
We show that our trained VarBERT model can predict variable names identical to the ones present in the original source code up to 84.15% of the time.
arXiv Detail & Related papers (2021-03-23T19:09:22Z) - Exploiting Method Names to Improve Code Summarization: A Deliberation
Multi-Task Learning Approach [5.577102440028882]
We design a novel multi-task learning (MTL) approach for code summarization.
We first introduce the tasks of generation and informativeness prediction of method names.
A novel two-pass deliberation mechanism is then incorporated into our MTL architecture to generate more consistent intermediate states.
arXiv Detail & Related papers (2021-03-21T17:52:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.