Related papers: A Natural Language Processing Approach for Instruction Set Architecture Identification

A Natural Language Processing Approach for Instruction Set Architecture Identification

URL: http://arxiv.org/abs/2204.06624v1
Date: Wed, 13 Apr 2022 19:45:06 GMT
Title: A Natural Language Processing Approach for Instruction Set Architecture Identification
Authors: Dinuka Sahabandu, Sukarno Mertoguno, Radha Poovendran
Abstract summary: We introduce character-level features of encoded binaries to identify fine-grained bit patterns inherent to each ISA. Our approach results in an 8% higher accuracy than the state-of-the-art features based on byte-histograms and byte pattern signatures.
Score: 6.495883501989546
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Binary analysis of software is a critical step in cyber forensics applications such as program vulnerability assessment and malware detection. This involves interpreting instructions executed by software and often necessitates converting the software's binary file data to assembly language. The conversion process requires information about the binary file's target instruction set architecture (ISA). However, ISA information might not be included in binary files due to compilation errors, partial downloads, or adversarial corruption of file metadata. Machine learning (ML) is a promising methodology that can be used to identify the target ISA using binary data in the object code section of binary files. In this paper we propose a binary code feature extraction model to improve the accuracy and scalability of ML-based ISA identification methods. Our feature extraction model can be used in the absence of domain knowledge about the ISAs. Specifically, we adapt models from natural language processing (NLP) to i) identify successive byte patterns commonly observed in binary codes, ii) estimate the significance of each byte pattern to a binary file, and iii) estimate the relevance of each byte pattern in distinguishing between ISAs. We introduce character-level features of encoded binaries to identify fine-grained bit patterns inherent to each ISA. We use a dataset with binaries from 12 different ISAs to evaluate our approach. Empirical evaluations show that using our byte-level features in ML-based ISA identification results in an 8% higher accuracy than the state-of-the-art features based on byte-histograms and byte pattern signatures. We observe that character-level features allow reducing the size of the feature set by up to 16x while maintaining accuracy above 97%.

Related papers

Beyond the Edge of Function: Unraveling the Patterns of Type Recovery in Binary Code [55.493408628371235]
We propose ByteTR, a framework for recovering variable types in binary code. In light of the ubiquity of variable propagation across functions, ByteTR conducts inter-procedural analysis to trace variable propagation and employs a gated graph neural network to capture long-range data flow dependencies for variable type recovery.
arXiv Detail & Related papers (2025-03-10T12:27:05Z)
StrTune: Data Dependence-based Code Slicing for Binary Similarity Detection with Fine-tuned Representation [5.41477941455399]
BCSD can address binary tasks such as malicious code snippets identification and binary patch analysis by comparing code patterns. Because binaries are compiled with different compilation configurations, existing approaches still face notable limitations when comparing binary similarity. We propose StrTune, which slices binary code based on data dependence and perform slice-level fine-tuning.
arXiv Detail & Related papers (2024-11-19T12:20:08Z)
Discovery of Endianness and Instruction Size Characteristics in Binary Programs from Unknown Instruction Set Architectures [0.0]
We study the problem of streamlining reverse engineering of binary programs from unknown instruction set architectures (ISA) We focus on two fundamental ISA characteristics to beginning the RE process: identification of endianness and whether the instruction width is a fixed or variable. We use bigram-based features for endianness detection and the autocorrelation function, commonly used in signal processing applications, for differentiation between fixed- and variable-width instruction sizes.
arXiv Detail & Related papers (2024-10-28T21:43:53Z)
Unsupervised Binary Code Translation with Application to Code Similarity Detection and Vulnerability Discovery [2.022692275087205]
Cross-architecture binary code analysis has become an emerging problem. Deep learning-based binary analysis has shown promising success. For some low-resource ISAs, an adequate amount of data is hard to find.
arXiv Detail & Related papers (2024-04-29T18:09:28Z)
How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z)
Accurate and Extensible Symbolic Execution of Binary Code based on Formal ISA Semantics [2.3589687550723335]
We present a prototype for the RISC-V ISA and conduct a case study to demonstrate that it can be easily extended to additional instructions. We perform an experimental comparison with prior work which resulted in the discovery of five previously unknown bugs in the ISA implementation of the popular IR-based symbolic executor angr.
arXiv Detail & Related papers (2024-04-05T14:29:39Z)
FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [54.27040631527217]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries. We first build a binary large language model (FoC-BinLLM) to summarize the semantics of cryptographic functions in natural language. We then build a binary code similarity model (FoC-Sim) upon the FoC-BinLLM to create change-sensitive representations and use it to retrieve similar implementations of unknown cryptographic functions in a database.
arXiv Detail & Related papers (2024-03-27T09:45:33Z)
Beyond Language Models: Byte Models are Digital World Simulators [68.91268999567473]
bGPT is a model with next byte prediction to simulate the digital world. It matches specialized models in performance across various modalities, including text, audio, and images. It has almost flawlessly replicated the process of converting symbolic music data, achieving a low error rate of 0.0011 bits per byte.
arXiv Detail & Related papers (2024-02-29T13:38:07Z)
Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z)
PEM: Representing Binary Program Semantics for Similarity Analysis via a Probabilistic Execution Model [25.014876893315208]
We propose a new method to represent binary program semantics. It is based on a novel probabilistic execution engine that can effectively sample the input space and the program path space of subject binaries. Our evaluation on 9 real-world projects with 35k functions, and comparison with 6 state-of-the-art techniques show that PEM can achieve a precision of 96% with common settings.
arXiv Detail & Related papers (2023-08-29T17:20:35Z)
Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model [57.92200214957124]
External language models (LMs) are used to improve the recognition performance of end-to-end (E2E) automatic speech recognition (ASR) systems. We propose a novel decoding algorithm where a word-level lattice is constructed on-the-fly to consider all possible word sequences. Our method consistently outperforms subword-level LMs, including N-gram LM and neural network LM.
arXiv Detail & Related papers (2022-01-06T10:04:56Z)
Semantic-aware Binary Code Representation with BERT [27.908093567605484]
A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code. Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary. In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code.
arXiv Detail & Related papers (2021-06-10T03:31:29Z)
Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing. We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.