Variable Name Recovery in Decompiled Binary Code using Constrained
Masked Language Modeling
- URL: http://arxiv.org/abs/2103.12801v1
- Date: Tue, 23 Mar 2021 19:09:22 GMT
- Title: Variable Name Recovery in Decompiled Binary Code using Constrained
Masked Language Modeling
- Authors: Pratyay Banerjee, Kuntal Kumar Pal, Fish Wang, Chitta Baral
- Abstract summary: Decompilation is the procedure of transforming binary programs into a high-level representation, such as source code, for human analysts to examine.
We propose a novel solution to infer variable names in decompiled code based on Masked Language Modeling, Byte-Pair.
We show that our trained VarBERT model can predict variable names identical to the ones present in the original source code up to 84.15% of the time.
- Score: 17.377157455292817
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Decompilation is the procedure of transforming binary programs into a
high-level representation, such as source code, for human analysts to examine.
While modern decompilers can reconstruct and recover much information that is
discarded during compilation, inferring variable names is still extremely
difficult. Inspired by recent advances in natural language processing, we
propose a novel solution to infer variable names in decompiled code based on
Masked Language Modeling, Byte-Pair Encoding, and neural architectures such as
Transformers and BERT. Our solution takes \textit{raw} decompiler output, the
less semantically meaningful code, as input, and enriches it using our proposed
\textit{finetuning} technique, Constrained Masked Language Modeling. Using
Constrained Masked Language Modeling introduces the challenge of predicting the
number of masked tokens for the original variable name. We address this
\textit{count of token prediction} challenge with our post-processing
algorithm. Compared to the state-of-the-art approaches, our trained VarBERT
model is simpler and of much better performance. We evaluated our model on an
existing large-scale data set with 164,632 binaries and showed that it can
predict variable names identical to the ones present in the original source
code up to 84.15\% of the time.
Related papers
- ReF Decompile: Relabeling and Function Call Enhanced Decompile [50.86228893636785]
The goal of decompilation is to convert compiled low-level code (e.g., assembly code) back into high-level programming languages.
This task supports various reverse engineering applications, such as vulnerability identification, malware analysis, and legacy software migration.
arXiv Detail & Related papers (2025-02-17T12:38:57Z) - Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models [3.382910438968506]
Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process.
We investigate a hierarchical architecture for autoregressive language modelling that combines character-level and word-level processing.
We demonstrate, at scales up to 7 billion parameters, that hierarchical transformers match the downstream task performance of subword-tokenizer-based models.
arXiv Detail & Related papers (2025-01-17T17:51:53Z) - From Language Models over Tokens to Language Models over Characters [54.123846188068384]
Modern language models are internally -- and mathematically -- distributions over token strings rather than emphcharacter strings.
This paper presents algorithms for converting token-level language models to character-level ones.
arXiv Detail & Related papers (2024-12-04T21:19:20Z) - STRIDE: Simple Type Recognition In Decompiled Executables [16.767295743254458]
We propose STRIDE, a technique that predicts variable names and types by matching sequences of decompiler tokens to those found in training data.
We evaluate it on three benchmark datasets and find that STRIDE achieves comparable performance to state-of-the-art machine learning models for both variable retyping and renaming.
arXiv Detail & Related papers (2024-07-03T01:09:41Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts.
Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness.
Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z) - Revisiting Deep Learning for Variable Type Recovery [3.075963833361584]
DIRTY is a Transformer-based-Decoder architecture capable of augmenting decompiled code with variable names and types.
We extend the original DIRTY results by re-training the DIRTY model on a dataset produced by the open-source Ghidra decompiler.
arXiv Detail & Related papers (2023-04-07T22:28:28Z) - Beyond the C: Retargetable Decompilation using Neural Machine
Translation [5.734661402742406]
We develop a prototype decompiler that is easily retargetable to new languages.
We examine the impact of parameters such as tokenization and training data selection on the quality of decompilation.
We will release our training data, trained decompilation models, and code to help encourage future research into language-agnostic decompilation.
arXiv Detail & Related papers (2022-12-17T20:45:59Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z) - UniLMv2: Pseudo-Masked Language Models for Unified Language Model
Pre-Training [152.63467944568094]
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks.
Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks.
arXiv Detail & Related papers (2020-02-28T15:28:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.