A Natural Language Processing Approach for Instruction Set Architecture
Identification
- URL: http://arxiv.org/abs/2204.06624v1
- Date: Wed, 13 Apr 2022 19:45:06 GMT
- Title: A Natural Language Processing Approach for Instruction Set Architecture
Identification
- Authors: Dinuka Sahabandu, Sukarno Mertoguno, Radha Poovendran
- Abstract summary: We introduce character-level features of encoded binaries to identify fine-grained bit patterns inherent to each ISA.
Our approach results in an 8% higher accuracy than the state-of-the-art features based on byte-histograms and byte pattern signatures.
- Score: 6.495883501989546
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Binary analysis of software is a critical step in cyber forensics
applications such as program vulnerability assessment and malware detection.
This involves interpreting instructions executed by software and often
necessitates converting the software's binary file data to assembly language.
The conversion process requires information about the binary file's target
instruction set architecture (ISA). However, ISA information might not be
included in binary files due to compilation errors, partial downloads, or
adversarial corruption of file metadata. Machine learning (ML) is a promising
methodology that can be used to identify the target ISA using binary data in
the object code section of binary files. In this paper we propose a binary code
feature extraction model to improve the accuracy and scalability of ML-based
ISA identification methods. Our feature extraction model can be used in the
absence of domain knowledge about the ISAs. Specifically, we adapt models from
natural language processing (NLP) to i) identify successive byte patterns
commonly observed in binary codes, ii) estimate the significance of each byte
pattern to a binary file, and iii) estimate the relevance of each byte pattern
in distinguishing between ISAs. We introduce character-level features of
encoded binaries to identify fine-grained bit patterns inherent to each ISA. We
use a dataset with binaries from 12 different ISAs to evaluate our approach.
Empirical evaluations show that using our byte-level features in ML-based ISA
identification results in an 8% higher accuracy than the state-of-the-art
features based on byte-histograms and byte pattern signatures. We observe that
character-level features allow reducing the size of the feature set by up to
16x while maintaining accuracy above 97%.
Related papers
- StrTune: Data Dependence-based Code Slicing for Binary Similarity Detection with Fine-tuned Representation [5.41477941455399]
BCSD can address binary tasks such as malicious code snippets identification and binary patch analysis by comparing code patterns.
Because binaries are compiled with different compilation configurations, existing approaches still face notable limitations when comparing binary similarity.
We propose StrTune, which slices binary code based on data dependence and perform slice-level fine-tuning.
arXiv Detail & Related papers (2024-11-19T12:20:08Z) - Discovery of Endianness and Instruction Size Characteristics in Binary Programs from Unknown Instruction Set Architectures [0.0]
We study the problem of streamlining reverse engineering of binary programs from unknown instruction set architectures (ISA)
We focus on two fundamental ISA characteristics to beginning the RE process: identification of endianness and whether the instruction width is a fixed or variable.
We use bigram-based features for endianness detection and the autocorrelation function, commonly used in signal processing applications, for differentiation between fixed- and variable-width instruction sizes.
arXiv Detail & Related papers (2024-10-28T21:43:53Z) - Unsupervised Binary Code Translation with Application to Code Similarity Detection and Vulnerability Discovery [2.022692275087205]
Cross-architecture binary code analysis has become an emerging problem.
Deep learning-based binary analysis has shown promising success.
For some low-resource ISAs, an adequate amount of data is hard to find.
arXiv Detail & Related papers (2024-04-29T18:09:28Z) - How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z) - FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [54.27040631527217]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries.
We first build a binary large language model (FoC-BinLLM) to summarize the semantics of cryptographic functions in natural language.
We then build a binary code similarity model (FoC-Sim) upon the FoC-BinLLM to create change-sensitive representations and use it to retrieve similar implementations of unknown cryptographic functions in a database.
arXiv Detail & Related papers (2024-03-27T09:45:33Z) - Beyond Language Models: Byte Models are Digital World Simulators [68.91268999567473]
bGPT is a model with next byte prediction to simulate the digital world.
It matches specialized models in performance across various modalities, including text, audio, and images.
It has almost flawlessly replicated the process of converting symbolic music data, achieving a low error rate of 0.0011 bits per byte.
arXiv Detail & Related papers (2024-02-29T13:38:07Z) - Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts.
Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness.
Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z) - PEM: Representing Binary Program Semantics for Similarity Analysis via a
Probabilistic Execution Model [25.014876893315208]
We propose a new method to represent binary program semantics.
It is based on a novel probabilistic execution engine that can effectively sample the input space and the program path space of subject binaries.
Our evaluation on 9 real-world projects with 35k functions, and comparison with 6 state-of-the-art techniques show that PEM can achieve a precision of 96% with common settings.
arXiv Detail & Related papers (2023-08-29T17:20:35Z) - Improving Mandarin End-to-End Speech Recognition with Word N-gram
Language Model [57.92200214957124]
External language models (LMs) are used to improve the recognition performance of end-to-end (E2E) automatic speech recognition (ASR) systems.
We propose a novel decoding algorithm where a word-level lattice is constructed on-the-fly to consider all possible word sequences.
Our method consistently outperforms subword-level LMs, including N-gram LM and neural network LM.
arXiv Detail & Related papers (2022-01-06T10:04:56Z) - Semantic-aware Binary Code Representation with BERT [27.908093567605484]
A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code.
Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary.
In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code.
arXiv Detail & Related papers (2021-06-10T03:31:29Z) - Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing.
We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.