Revisiting Lightweight Compiler Provenance Recovery on ARM Binaries
- URL: http://arxiv.org/abs/2305.03934v1
- Date: Sat, 6 May 2023 05:20:39 GMT
- Title: Revisiting Lightweight Compiler Provenance Recovery on ARM Binaries
- Authors: Jason Kim, Daniel Genkin, Kevin Leach
- Abstract summary: We extend previous work with a shallow-learning model that efficiently and accurately recovers compiler configuration properties for ARM binaries.
We achieve over 99% accuracy, on par with state-of-the-art deep learning approaches, while achieving a 583-times speedup during training and 3,826-times speedup during inference.
- Score: 10.38910167947036
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A binary's behavior is greatly influenced by how the compiler builds its
source code. Although most compiler configuration details are abstracted away
during compilation, recovering them is useful for reverse engineering and
program comprehension tasks on unknown binaries, such as code similarity
detection. We observe that previous work has thoroughly explored this on x86-64
binaries. However, there has been limited investigation of ARM binaries, which
are increasingly prevalent.
In this paper, we extend previous work with a shallow-learning model that
efficiently and accurately recovers compiler configuration properties for ARM
binaries. We apply opcode and register-derived features, that have previously
been effective on x86-64 binaries, to ARM binaries. Furthermore, we compare
this work with Pizzolotto et al., a recent architecture-agnostic model that
uses deep learning, whose dataset and code are available.
We observe that the lightweight features are reproducible on ARM binaries. We
achieve over 99% accuracy, on par with state-of-the-art deep learning
approaches, while achieving a 583-times speedup during training and 3,826-times
speedup during inference. Finally, we also discuss findings of overfitting that
was previously undetected in prior work.
Related papers
- Breaking Bad: How Compilers Break Constant-Time~Implementations [12.486727810118497]
We investigate how compilers break protections introduced by defensive programming techniques.
We run a large-scale experiment to see if such compiler-induced issues manifest in state-of-the-art cryptographic libraries.
Our study reveals that several compiler-induced secret-dependent operations occur within some of the most highly regarded cryptographic libraries.
arXiv Detail & Related papers (2024-10-17T12:34:02Z) - KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks.
In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel.
To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z) - CP-BCS: Binary Code Summarization Guided by Control Flow Graph and
Pseudo Code [79.87518649544405]
We present a control flow graph and pseudo code guided binary code summarization framework called CP-BCS.
CP-BCS utilizes a bidirectional instruction-level control flow graph and pseudo code that incorporates expert knowledge to learn the comprehensive binary function execution behavior and logic semantics.
arXiv Detail & Related papers (2023-10-24T14:20:39Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - HDCC: A Hyperdimensional Computing compiler for classification on
embedded systems and high-performance computing [58.720142291102135]
This work introduces the name compiler, the first open-source compiler that translates high-level descriptions of HDC classification methods into optimized C code.
name is designed like a modern compiler, featuring an intuitive and descriptive input language, an intermediate representation (IR), and a retargetable backend.
To substantiate these claims, we conducted experiments with HDCC on several of the most popular datasets in the HDC literature.
arXiv Detail & Related papers (2023-04-24T19:16:03Z) - Extending Source Code Pre-Trained Language Models to Summarise
Decompiled Binaries [4.0484792045035505]
We extend large pre-trained language models of source code to summarise decompiled binary functions.
We investigate the impact of input and data properties on the performance of such models.
BinT5 achieves the state-of-the-art BLEU-4 score of 60.83, 58.82, and 44.21 for summarising source, decompiled, and synthetically stripped decompiled code.
arXiv Detail & Related papers (2023-01-04T16:56:33Z) - ARMS: Antithetic-REINFORCE-Multi-Sample Gradient for Binary Variables [60.799183326613395]
Antithetic REINFORCE-based Multi-Sample gradient estimator.
ARMS uses a copula to generate any number of mutually antithetic samples.
We evaluate ARMS on several datasets for training generative models, and our experimental results show that it outperforms competing methods.
arXiv Detail & Related papers (2021-05-28T23:19:54Z) - UNIT: Unifying Tensorized Instruction Compilation [11.193044425743981]
Hardware vendors offer tensorized instructions for mixed-precision operations, like Intel VNNI, Core, and ARM-DOT.
The lack of compilation techniques for this makes it hard to utilize these instructions.
We develop a compiler framework to unify the compilation for these instructions.
arXiv Detail & Related papers (2021-01-21T06:22:58Z) - Improving type information inferred by decompilers with supervised
machine learning [0.0]
In software reverse engineering, decompilation is the process of recovering source code from binary files.
We build different classification models capable of inferring the high-level type returned by functions.
Our system is able to predict function return types with a 79.1% F1-measure, whereas the best decompiler obtains a 30% F1-measure.
arXiv Detail & Related papers (2021-01-19T11:45:46Z) - XDA: Accurate, Robust Disassembly with Transfer Learning [23.716121748941138]
XDA is a transfer-learning-based disassembly framework.
It learns different contextual dependencies present in machine code.
It is up to 38x faster than hand-written disassemblers like IDA Pro.
arXiv Detail & Related papers (2020-10-02T04:14:17Z) - PolyDL: Polyhedral Optimizations for Creation of High Performance DL
primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives.
We develop novel data reuse analysis algorithms using the polyhedral model.
We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.