FASER: Binary Code Similarity Search through the use of Intermediate
Representations
- URL: http://arxiv.org/abs/2310.03605v3
- Date: Wed, 29 Nov 2023 14:30:29 GMT
- Title: FASER: Binary Code Similarity Search through the use of Intermediate
Representations
- Authors: Josh Collyer, Tim Watson and Iain Phillips
- Abstract summary: Cross-Architecture Binary Code Similarity Search has been explored in numerous studies.
We propose Function as a String Encoded Representation (FASER) to create a model capable of cross architecture function search.
- Score: 0.8594140167290099
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Being able to identify functions of interest in cross-architecture software
is useful whether you are analysing for malware, securing the software supply
chain or conducting vulnerability research. Cross-Architecture Binary Code
Similarity Search has been explored in numerous studies and has used a wide
range of different data sources to achieve its goals. The data sources
typically used draw on common structures derived from binaries such as function
control flow graphs or binary level call graphs, the output of the disassembly
process or the outputs of a dynamic analysis approach. One data source which
has received less attention is binary intermediate representations. Binary
Intermediate representations possess two interesting properties: they are cross
architecture by their very nature and encode the semantics of a function
explicitly to support downstream usage. Within this paper we propose Function
as a String Encoded Representation (FASER) which combines long document
transformers with the use of intermediate representations to create a model
capable of cross architecture function search without the need for manual
feature engineering, pre-training or a dynamic analysis step. We compare our
approach against a series of baseline approaches for two tasks; A general
function search task and a targeted vulnerability search task. Our approach
demonstrates strong performance across both tasks, performing better than all
baseline approaches.
Related papers
- Is Function Similarity Over-Engineered? Building a Benchmark [37.33020176141435]
We build a new benchmark for binary function similarity detection consisting of high-quality datasets and tests that better reflect real-world use cases.
Our benchmark reveals that a new, simple basline, one which looks at only the raw bytes of a function, and requires no disassembly or other pre-processing, is able to achieve state-of-the-art performance in multiple settings.
arXiv Detail & Related papers (2024-10-30T03:59:46Z) - Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations [52.34030226129628]
Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification.
In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction.
Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.
arXiv Detail & Related papers (2024-10-24T09:09:20Z) - GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning [51.677086019209554]
We propose a Generalized Structural Sparse to capture powerful relationships across modalities for pair-wise similarity learning.
The distance metric delicately encapsulates two formats of diagonal and block-diagonal terms.
Experiments on cross-modal and two extra uni-modal retrieval tasks have validated its superiority and flexibility.
arXiv Detail & Related papers (2024-10-20T03:45:50Z) - Know Your Neighborhood: General and Zero-Shot Capable Binary Function Search Powered by Call Graphlets [0.7646713951724013]
This paper proposes a novel graph neural network architecture combined with a novel graph data representation called call graphlets.
A specialized graph neural network model is then designed to operate on this graph representation, learning to map it to a feature vector that encodes semantic code similarities.
Experimental results demonstrate that the combination of call graphlets and the novel graph neural network architecture achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-06-02T18:26:50Z) - TRIAD: Automated Traceability Recovery based on Biterm-enhanced
Deduction of Transitive Links among Artifacts [53.92293118080274]
Traceability allows stakeholders to extract and comprehend the trace links among software artifacts introduced across the software life cycle.
Most rely on textual similarities among software artifacts, such as those based on Information Retrieval (IR)
arXiv Detail & Related papers (2023-12-28T06:44:24Z) - Feature Decoupling-Recycling Network for Fast Interactive Segmentation [79.22497777645806]
Recent interactive segmentation methods iteratively take source image, user guidance and previously predicted mask as the input.
We propose the Feature Decoupling-Recycling Network (FDRN), which decouples the modeling components based on their intrinsic discrepancies.
arXiv Detail & Related papers (2023-08-07T12:26:34Z) - UniASM: Binary Code Similarity Detection without Fine-tuning [0.8271859911016718]
We propose a novel transformer-based binary code embedding model named UniASM to learn representations of the binary functions.
In the real-world task of known vulnerability search, UniASM outperforms all the current baselines.
arXiv Detail & Related papers (2022-10-28T14:04:57Z) - Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for
Improved Cross-Modal Retrieval [80.35589927511667]
Current state-of-the-art approaches to cross-modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image.
We propose a novel fine-tuning framework which turns any pretrained text-image multi-modal model into an efficient retrieval model.
Our experiments on a series of standard cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot setups, demonstrate improved accuracy and huge efficiency benefits over the state-of-the-art cross-encoders.
arXiv Detail & Related papers (2021-03-22T15:08:06Z) - Unsupervised Deep Cross-modality Spectral Hashing [65.3842441716661]
The framework is a two-step hashing approach which decouples the optimization into binary optimization and hashing function learning.
We propose a novel spectral embedding-based algorithm to simultaneously learn single-modality and binary cross-modality representations.
We leverage the powerful CNN for images and propose a CNN-based deep architecture to learn text modality.
arXiv Detail & Related papers (2020-08-01T09:20:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.