XDA: Accurate, Robust Disassembly with Transfer Learning
- URL: http://arxiv.org/abs/2010.00770v3
- Date: Thu, 19 Nov 2020 04:40:24 GMT
- Title: XDA: Accurate, Robust Disassembly with Transfer Learning
- Authors: Kexin Pei, Jonas Guan, David Williams-King, Junfeng Yang, Suman Jana
- Abstract summary: XDA is a transfer-learning-based disassembly framework.
It learns different contextual dependencies present in machine code.
It is up to 38x faster than hand-written disassemblers like IDA Pro.
- Score: 23.716121748941138
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate and robust disassembly of stripped binaries is challenging. The root
of the difficulty is that high-level structures, such as instruction and
function boundaries, are absent in stripped binaries and must be recovered
based on incomplete information. Current disassembly approaches rely on
heuristics or simple pattern matching to approximate the recovery, but these
methods are often inaccurate and brittle, especially across different compiler
optimizations.
We present XDA, a transfer-learning-based disassembly framework that learns
different contextual dependencies present in machine code and transfers this
knowledge for accurate and robust disassembly. We design a self-supervised
learning task motivated by masked Language Modeling to learn interactions among
byte sequences in binaries. The outputs from this task are byte embeddings that
encode sophisticated contextual dependencies between input binaries' byte
tokens, which can then be finetuned for downstream disassembly tasks.
We evaluate XDA's performance on two disassembly tasks, recovering function
boundaries and assembly instructions, on a collection of 3,121 binaries taken
from SPEC CPU2017, SPEC CPU2006, and the BAP corpus. The binaries are compiled
by GCC, ICC, and MSVC on x86/x64 Windows and Linux platforms over 4
optimization levels. XDA achieves 99.0% and 99.7% F1 score at recovering
function boundaries and instructions, respectively, surpassing the previous
state-of-the-art on both tasks. It also maintains speed on par with the fastest
ML-based approach and is up to 38x faster than hand-written disassemblers like
IDA Pro. We release the code of XDA at https://github.com/CUMLSec/XDA.
Related papers
- Disassembling Obfuscated Executables with LLM [9.897996716496982]
We present DisasLLM, a novel dissembler to overcome the challenge in analyzing obfuscated executables.
DisasLLM consists of two components: an LLM-based classifier that determines whether an instruction in an assembly code snippet is correctly decoded, and a disassembly strategy that leverages this model to disassemble obfuscated executables end-to-end.
We evaluated DisasLLM on a set of heavily obfuscated executables, which is shown to significantly outperform other state-of-the-art disassembly solutions.
arXiv Detail & Related papers (2024-07-12T02:10:07Z) - KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks.
In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel.
To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z) - CP-BCS: Binary Code Summarization Guided by Control Flow Graph and
Pseudo Code [79.87518649544405]
We present a control flow graph and pseudo code guided binary code summarization framework called CP-BCS.
CP-BCS utilizes a bidirectional instruction-level control flow graph and pseudo code that incorporates expert knowledge to learn the comprehensive binary function execution behavior and logic semantics.
arXiv Detail & Related papers (2023-10-24T14:20:39Z) - Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts.
Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness.
Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z) - Exploring Continual Learning for Code Generation Models [80.78036093054855]
Continual Learning (CL) is an important aspect that remains underexplored in the code domain.
We introduce a benchmark called CodeTask-CL that covers a wide range of tasks, including code generation, translation, summarization, and refinement.
We find that effective methods like Prompt Pooling (PP) suffer from catastrophic forgetting due to the unstable training of the prompt selection mechanism.
arXiv Detail & Related papers (2023-07-05T16:58:39Z) - Revisiting Lightweight Compiler Provenance Recovery on ARM Binaries [10.38910167947036]
We extend previous work with a shallow-learning model that efficiently and accurately recovers compiler configuration properties for ARM binaries.
We achieve over 99% accuracy, on par with state-of-the-art deep learning approaches, while achieving a 583-times speedup during training and 3,826-times speedup during inference.
arXiv Detail & Related papers (2023-05-06T05:20:39Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - HDCC: A Hyperdimensional Computing compiler for classification on
embedded systems and high-performance computing [58.720142291102135]
This work introduces the name compiler, the first open-source compiler that translates high-level descriptions of HDC classification methods into optimized C code.
name is designed like a modern compiler, featuring an intuitive and descriptive input language, an intermediate representation (IR), and a retargetable backend.
To substantiate these claims, we conducted experiments with HDCC on several of the most popular datasets in the HDC literature.
arXiv Detail & Related papers (2023-04-24T19:16:03Z) - NeuDep: Neural Binary Memory Dependence Analysis [28.33030658966508]
We present a new machine-learning-based approach to predict memory dependencies by exploiting the model's learned knowledge about how binary programs execute.
We implement our approach in NeuDep and evaluate it on 41 popular software projects compiled by 2 compilers, 4 optimizations, and 4 obfuscation passes.
arXiv Detail & Related papers (2022-10-04T04:59:36Z) - SimCLF: A Simple Contrastive Learning Framework for Function-level
Binary Embeddings [2.1222884030559315]
We propose SimCLF: A Simple Contrastive Learning Framework for Function-level Binary Embeddings.
We take an unsupervised learning approach and formulate binary code similarity detection as instance discrimination.
SimCLF directly operates on disassembled binary functions and could be implemented with any encoder.
arXiv Detail & Related papers (2022-09-06T12:09:45Z) - UNIT: Unifying Tensorized Instruction Compilation [11.193044425743981]
Hardware vendors offer tensorized instructions for mixed-precision operations, like Intel VNNI, Core, and ARM-DOT.
The lack of compilation techniques for this makes it hard to utilize these instructions.
We develop a compiler framework to unify the compilation for these instructions.
arXiv Detail & Related papers (2021-01-21T06:22:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.