Related papers: Revisiting Deep Learning for Variable Type Recovery

Revisiting Deep Learning for Variable Type Recovery

URL: http://arxiv.org/abs/2304.03854v1
Date: Fri, 7 Apr 2023 22:28:28 GMT
Title: Revisiting Deep Learning for Variable Type Recovery
Authors: Kevin Cao, Kevin Leach
Abstract summary: DIRTY is a Transformer-based-Decoder architecture capable of augmenting decompiled code with variable names and types. We extend the original DIRTY results by re-training the DIRTY model on a dataset produced by the open-source Ghidra decompiler.
Score: 3.075963833361584
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Compiled binary executables are often the only available artifact in reverse engineering, malware analysis, and software systems maintenance. Unfortunately, the lack of semantic information like variable types makes comprehending binaries difficult. In efforts to improve the comprehensibility of binaries, researchers have recently used machine learning techniques to predict semantic information contained in the original source code. Chen et al. implemented DIRTY, a Transformer-based Encoder-Decoder architecture capable of augmenting decompiled code with variable names and types by leveraging decompiler output tokens and variable size information. Chen et al. were able to demonstrate a substantial increase in name and type extraction accuracy on Hex-Rays decompiler outputs compared to existing static analysis and AI-based techniques. We extend the original DIRTY results by re-training the DIRTY model on a dataset produced by the open-source Ghidra decompiler. Although Chen et al. concluded that Ghidra was not a suitable decompiler candidate due to its difficulty in parsing and incorporating DWARF symbols during analysis, we demonstrate that straightforward parsing of variable data generated by Ghidra results in similar retyping performance. We hope this work inspires further interest and adoption of the Ghidra decompiler for use in research projects.

Related papers

Beyond the Edge of Function: Unraveling the Patterns of Type Recovery in Binary Code [55.493408628371235]
We propose ByteTR, a framework for recovering variable types in binary code. In light of the ubiquity of variable propagation across functions, ByteTR conducts inter-procedural analysis to trace variable propagation and employs a gated graph neural network to capture long-range data flow dependencies for variable type recovery.
arXiv Detail & Related papers (2025-03-10T12:27:05Z)
ReF Decompile: Relabeling and Function Call Enhanced Decompile [50.86228893636785]
The goal of decompilation is to convert compiled low-level code (e.g., assembly code) back into high-level programming languages. This task supports various reverse engineering applications, such as vulnerability identification, malware analysis, and legacy software migration.
arXiv Detail & Related papers (2025-02-17T12:38:57Z)
GeAR: Generation Augmented Retrieval [82.20696567697016]
Document retrieval techniques form the foundation for the development of large-scale information systems. The prevailing methodology is to construct a bi-encoder and compute the semantic similarity. We propose a new method called $textbfGe$neration that incorporates well-designed fusion and decoding modules.
arXiv Detail & Related papers (2025-01-06T05:29:00Z)
STRIDE: Simple Type Recognition In Decompiled Executables [16.767295743254458]
We propose STRIDE, a technique that predicts variable names and types by matching sequences of decompiler tokens to those found in training data. We evaluate it on three benchmark datasets and find that STRIDE achieves comparable performance to state-of-the-art machine learning models for both variable retyping and renaming.
arXiv Detail & Related papers (2024-07-03T01:09:41Z)
FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [54.27040631527217]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries. We first build a binary large language model (FoC-BinLLM) to summarize the semantics of cryptographic functions in natural language. We then build a binary code similarity model (FoC-Sim) upon the FoC-BinLLM to create change-sensitive representations and use it to retrieve similar implementations of unknown cryptographic functions in a database.
arXiv Detail & Related papers (2024-03-27T09:45:33Z)
SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects. We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z)
Refining Decompiled C Code with Large Language Models [15.76430362775126]
A C decompiler converts an executable into source code. The recovered C source code, once re-compiled, is expected to produce an executable with the same functionality as the original executable.
arXiv Detail & Related papers (2023-10-10T11:22:30Z)
Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z)
Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries [4.0484792045035505]
We extend large pre-trained language models of source code to summarise decompiled binary functions. We investigate the impact of input and data properties on the performance of such models. BinT5 achieves the state-of-the-art BLEU-4 score of 60.83, 58.82, and 44.21 for summarising source, decompiled, and synthetically stripped decompiled code.
arXiv Detail & Related papers (2023-01-04T16:56:33Z)
MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers [140.0479479231558]
In this work, we aim to unify a variety of pre-training tasks into a multi-task pre-trained model, namely MASTER. MASTER utilizes a shared-encoder multi-decoder architecture that can construct a representation bottleneck to compress the abundant semantic information across tasks into dense vectors.
arXiv Detail & Related papers (2022-12-15T13:57:07Z)
Variable Name Recovery in Decompiled Binary Code using Constrained Masked Language Modeling [17.377157455292817]
Decompilation is the procedure of transforming binary programs into a high-level representation, such as source code, for human analysts to examine. We propose a novel solution to infer variable names in decompiled code based on Masked Language Modeling, Byte-Pair. We show that our trained VarBERT model can predict variable names identical to the ones present in the original source code up to 84.15% of the time.
arXiv Detail & Related papers (2021-03-23T19:09:22Z)
Improving type information inferred by decompilers with supervised machine learning [0.0]
In software reverse engineering, decompilation is the process of recovering source code from binary files. We build different classification models capable of inferring the high-level type returned by functions. Our system is able to predict function return types with a 79.1% F1-measure, whereas the best decompiler obtains a 30% F1-measure.
arXiv Detail & Related papers (2021-01-19T11:45:46Z)
Autoencoding Variational Autoencoder [56.05008520271406]
We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency. We show that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks.
arXiv Detail & Related papers (2020-12-07T14:16:14Z)
A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens. We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.