Revisiting Deep Learning for Variable Type Recovery
- URL: http://arxiv.org/abs/2304.03854v1
- Date: Fri, 7 Apr 2023 22:28:28 GMT
- Title: Revisiting Deep Learning for Variable Type Recovery
- Authors: Kevin Cao, Kevin Leach
- Abstract summary: DIRTY is a Transformer-based-Decoder architecture capable of augmenting decompiled code with variable names and types.
We extend the original DIRTY results by re-training the DIRTY model on a dataset produced by the open-source Ghidra decompiler.
- Score: 3.075963833361584
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compiled binary executables are often the only available artifact in reverse
engineering, malware analysis, and software systems maintenance. Unfortunately,
the lack of semantic information like variable types makes comprehending
binaries difficult. In efforts to improve the comprehensibility of binaries,
researchers have recently used machine learning techniques to predict semantic
information contained in the original source code. Chen et al. implemented
DIRTY, a Transformer-based Encoder-Decoder architecture capable of augmenting
decompiled code with variable names and types by leveraging decompiler output
tokens and variable size information. Chen et al. were able to demonstrate a
substantial increase in name and type extraction accuracy on Hex-Rays
decompiler outputs compared to existing static analysis and AI-based
techniques. We extend the original DIRTY results by re-training the DIRTY model
on a dataset produced by the open-source Ghidra decompiler. Although Chen et
al. concluded that Ghidra was not a suitable decompiler candidate due to its
difficulty in parsing and incorporating DWARF symbols during analysis, we
demonstrate that straightforward parsing of variable data generated by Ghidra
results in similar retyping performance. We hope this work inspires further
interest and adoption of the Ghidra decompiler for use in research projects.
Related papers
- STRIDE: Simple Type Recognition In Decompiled Executables [16.767295743254458]
We propose STRIDE, a technique that predicts variable names and types by matching sequences of decompiler tokens to those found in training data.
We evaluate it on three benchmark datasets and find that STRIDE achieves comparable performance to state-of-the-art machine learning models for both variable retyping and renaming.
arXiv Detail & Related papers (2024-07-03T01:09:41Z) - FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [54.27040631527217]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries.
FoC-BinLLM outperforms ChatGPT by 14.61% on the ROUGE-L score.
FoC-Sim outperforms the previous best methods with a 52% higher Recall@1.
arXiv Detail & Related papers (2024-03-27T09:45:33Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - Refining Decompiled C Code with Large Language Models [15.76430362775126]
A C decompiler converts an executable into source code.
The recovered C source code, once re-compiled, is expected to produce an executable with the same functionality as the original executable.
arXiv Detail & Related papers (2023-10-10T11:22:30Z) - Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts.
Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness.
Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z) - Extending Source Code Pre-Trained Language Models to Summarise
Decompiled Binaries [4.0484792045035505]
We extend large pre-trained language models of source code to summarise decompiled binary functions.
We investigate the impact of input and data properties on the performance of such models.
BinT5 achieves the state-of-the-art BLEU-4 score of 60.83, 58.82, and 44.21 for summarising source, decompiled, and synthetically stripped decompiled code.
arXiv Detail & Related papers (2023-01-04T16:56:33Z) - MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are
Better Dense Retrievers [140.0479479231558]
In this work, we aim to unify a variety of pre-training tasks into a multi-task pre-trained model, namely MASTER.
MASTER utilizes a shared-encoder multi-decoder architecture that can construct a representation bottleneck to compress the abundant semantic information across tasks into dense vectors.
arXiv Detail & Related papers (2022-12-15T13:57:07Z) - Variable Name Recovery in Decompiled Binary Code using Constrained
Masked Language Modeling [17.377157455292817]
Decompilation is the procedure of transforming binary programs into a high-level representation, such as source code, for human analysts to examine.
We propose a novel solution to infer variable names in decompiled code based on Masked Language Modeling, Byte-Pair.
We show that our trained VarBERT model can predict variable names identical to the ones present in the original source code up to 84.15% of the time.
arXiv Detail & Related papers (2021-03-23T19:09:22Z) - Improving type information inferred by decompilers with supervised
machine learning [0.0]
In software reverse engineering, decompilation is the process of recovering source code from binary files.
We build different classification models capable of inferring the high-level type returned by functions.
Our system is able to predict function return types with a 79.1% F1-measure, whereas the best decompiler obtains a 30% F1-measure.
arXiv Detail & Related papers (2021-01-19T11:45:46Z) - Autoencoding Variational Autoencoder [56.05008520271406]
We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency.
We show that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks.
arXiv Detail & Related papers (2020-12-07T14:16:14Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.