Related papers: Embedding Java Classes with code2vec: Improvements from Variable Obfuscation

Embedding Java Classes with code2vec: Improvements from Variable Obfuscation

URL: http://arxiv.org/abs/2004.02942v1
Date: Mon, 6 Apr 2020 19:05:18 GMT
Title: Embedding Java Classes with code2vec: Improvements from Variable Obfuscation
Authors: Rhys Compton, Eibe Frank, Panos Patros, Abigail Koay
Abstract summary: This paper investigates the effect of obfuscating variable names during the training of a code2vec model to force it to rely on the structure of the code rather than specific names. Our results, obtained on a challenging new collection of source-code classification problems, indicate that obfuscating variable names produces an embedding model that is both impervious to variable naming and more accurately reflects code semantics.
Score: 0.716879432974126
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automatic source code analysis in key areas of software engineering, such as code security, can benefit from Machine Learning (ML). However, many standard ML approaches require a numeric representation of data and cannot be applied directly to source code. Thus, to enable ML, we need to embed source code into numeric feature vectors while maintaining the semantics of the code as much as possible. code2vec is a recently released embedding approach that uses the proxy task of method name prediction to map Java methods to feature vectors. However, experimentation with code2vec shows that it learns to rely on variable names for prediction, causing it to be easily fooled by typos or adversarial attacks. Moreover, it is only able to embed individual Java methods and cannot embed an entire collection of methods such as those present in a typical Java class, making it difficult to perform predictions at the class level (e.g., for the identification of malicious Java classes). Both shortcomings are addressed in the research presented in this paper. We investigate the effect of obfuscating variable names during the training of a code2vec model to force it to rely on the structure of the code rather than specific names and consider a simple approach to creating class-level embeddings by aggregating sets of method embeddings. Our results, obtained on a challenging new collection of source-code classification problems, indicate that obfuscating variable names produces an embedding model that is both impervious to variable naming and more accurately reflects code semantics. The datasets, models, and code are shared for further ML research on source code.

Related papers

STRIDE: Simple Type Recognition In Decompiled Executables [16.767295743254458]
We propose STRIDE, a technique that predicts variable names and types by matching sequences of decompiler tokens to those found in training data. We evaluate it on three benchmark datasets and find that STRIDE achieves comparable performance to state-of-the-art machine learning models for both variable retyping and renaming.
arXiv Detail & Related papers (2024-07-03T01:09:41Z)
Variables are a Curse in Software Vulnerability Prediction [4.453430599945387]
We introduce a new type of edge called name dependence, a type of abstract syntax graph based on the name dependence, and an efficient node representation method named 3-property encoding scheme. These techniques will allow us to remove the concrete variable names from code, and facilitate deep learning models to learn the functionality of software hidden in diverse code expressions.
arXiv Detail & Related papers (2024-06-18T16:02:29Z)
Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery [50.564146730579424]
We propose a Text Embedding Synthesizer (TES) to generate pseudo text embeddings for unlabelled samples. Our method unlocks the multi-modal potentials of CLIP and outperforms the baseline methods by a large margin on all GCD benchmarks.
arXiv Detail & Related papers (2024-03-15T02:40:13Z)
SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects. We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z)
Multi-Label Knowledge Distillation [86.03990467785312]
We propose a novel multi-label knowledge distillation method. On one hand, it exploits the informative semantic knowledge from the logits by dividing the multi-label learning problem into a set of binary classification problems. On the other hand, it enhances the distinctiveness of the learned feature representations by leveraging the structural information of label-wise embeddings.
arXiv Detail & Related papers (2023-08-12T03:19:08Z)
Multi-Instance Partial-Label Learning: Towards Exploiting Dual Inexact Supervision [53.530957567507365]
In some real-world tasks, each training sample is associated with a candidate label set that contains one ground-truth label and some false positive labels. In this paper, we formalize such problems as multi-instance partial-label learning (MIPL) Existing multi-instance learning algorithms and partial-label learning algorithms are suboptimal for solving MIPL problems.
arXiv Detail & Related papers (2022-12-18T03:28:51Z)
CLAWSAT: Towards Both Robust and Accurate Code Models [74.57590254102311]
We integrate contrastive learning (CL) with adversarial learning to co-optimize the robustness and accuracy of code models. To the best of our knowledge, this is the first systematic study to explore and exploit the robustness and accuracy benefits of (multi-view) code obfuscations in code models.
arXiv Detail & Related papers (2022-11-21T18:32:50Z)
VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning [84.70916463298109]
VarCLR is a new approach for learning semantic representations of variable names. VarCLR is an excellent fit for contrastive learning, which aims to minimize the distance between explicitly similar inputs. We show that VarCLR enables the effective application of sophisticated, general-purpose language models like BERT.
arXiv Detail & Related papers (2021-12-05T18:40:32Z)
BERT2Code: Can Pretrained Language Models be Leveraged for Code Search? [0.7953229555481884]
We show that our model learns the inherent relationship between the embedding spaces and further probes into the scope of improvement. In this analysis, we show that the quality of the code embedding model is the bottleneck for our model's performance.
arXiv Detail & Related papers (2021-04-16T10:28:27Z)
Variable Name Recovery in Decompiled Binary Code using Constrained Masked Language Modeling [17.377157455292817]
Decompilation is the procedure of transforming binary programs into a high-level representation, such as source code, for human analysts to examine. We propose a novel solution to infer variable names in decompiled code based on Masked Language Modeling, Byte-Pair. We show that our trained VarBERT model can predict variable names identical to the ones present in the original source code up to 84.15% of the time.
arXiv Detail & Related papers (2021-03-23T19:09:22Z)
Exploiting Method Names to Improve Code Summarization: A Deliberation Multi-Task Learning Approach [5.577102440028882]
We design a novel multi-task learning (MTL) approach for code summarization. We first introduce the tasks of generation and informativeness prediction of method names. A novel two-pass deliberation mechanism is then incorporated into our MTL architecture to generate more consistent intermediate states.
arXiv Detail & Related papers (2021-03-21T17:52:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.