Variable Name Recovery in Decompiled Binary Code using Constrained
Masked Language Modeling
- URL: http://arxiv.org/abs/2103.12801v1
- Date: Tue, 23 Mar 2021 19:09:22 GMT
- Title: Variable Name Recovery in Decompiled Binary Code using Constrained
Masked Language Modeling
- Authors: Pratyay Banerjee, Kuntal Kumar Pal, Fish Wang, Chitta Baral
- Abstract summary: Decompilation is the procedure of transforming binary programs into a high-level representation, such as source code, for human analysts to examine.
We propose a novel solution to infer variable names in decompiled code based on Masked Language Modeling, Byte-Pair.
We show that our trained VarBERT model can predict variable names identical to the ones present in the original source code up to 84.15% of the time.
- Score: 17.377157455292817
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Decompilation is the procedure of transforming binary programs into a
high-level representation, such as source code, for human analysts to examine.
While modern decompilers can reconstruct and recover much information that is
discarded during compilation, inferring variable names is still extremely
difficult. Inspired by recent advances in natural language processing, we
propose a novel solution to infer variable names in decompiled code based on
Masked Language Modeling, Byte-Pair Encoding, and neural architectures such as
Transformers and BERT. Our solution takes \textit{raw} decompiler output, the
less semantically meaningful code, as input, and enriches it using our proposed
\textit{finetuning} technique, Constrained Masked Language Modeling. Using
Constrained Masked Language Modeling introduces the challenge of predicting the
number of masked tokens for the original variable name. We address this
\textit{count of token prediction} challenge with our post-processing
algorithm. Compared to the state-of-the-art approaches, our trained VarBERT
model is simpler and of much better performance. We evaluated our model on an
existing large-scale data set with 164,632 binaries and showed that it can
predict variable names identical to the ones present in the original source
code up to 84.15\% of the time.
Related papers
- STRIDE: Simple Type Recognition In Decompiled Executables [16.767295743254458]
We propose STRIDE, a technique that predicts variable names and types by matching sequences of decompiler tokens to those found in training data.
We evaluate it on three benchmark datasets and find that STRIDE achieves comparable performance to state-of-the-art machine learning models for both variable retyping and renaming.
arXiv Detail & Related papers (2024-07-03T01:09:41Z) - Calibration & Reconstruction: Deep Integrated Language for Referring Image Segmentation [8.225408779913712]
Referring image segmentation aims to segment an object referred to by natural language expression from an image.
Conventional transformer decoders can distort linguistic information with deeper layers, leading to suboptimal results.
We introduce CRFormer, a model that iteratively calibrates multi-modal features in the transformer decoder.
arXiv Detail & Related papers (2024-04-12T07:13:32Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts.
Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness.
Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z) - Revisiting Deep Learning for Variable Type Recovery [3.075963833361584]
DIRTY is a Transformer-based-Decoder architecture capable of augmenting decompiled code with variable names and types.
We extend the original DIRTY results by re-training the DIRTY model on a dataset produced by the open-source Ghidra decompiler.
arXiv Detail & Related papers (2023-04-07T22:28:28Z) - Beyond the C: Retargetable Decompilation using Neural Machine
Translation [5.734661402742406]
We develop a prototype decompiler that is easily retargetable to new languages.
We examine the impact of parameters such as tokenization and training data selection on the quality of decompilation.
We will release our training data, trained decompilation models, and code to help encourage future research into language-agnostic decompilation.
arXiv Detail & Related papers (2022-12-17T20:45:59Z) - Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model.
We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder.
We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z) - 2kenize: Tying Subword Sequences for Chinese Script Conversion [54.33749520569979]
We propose a model that can disambiguate between mappings and convert between the two scripts.
Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy.
arXiv Detail & Related papers (2020-05-07T10:53:05Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z) - UniLMv2: Pseudo-Masked Language Models for Unified Language Model
Pre-Training [152.63467944568094]
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks.
Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks.
arXiv Detail & Related papers (2020-02-28T15:28:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.