FASER: Binary Code Similarity Search through the use of Intermediate
Representations
- URL: http://arxiv.org/abs/2310.03605v3
- Date: Wed, 29 Nov 2023 14:30:29 GMT
- Title: FASER: Binary Code Similarity Search through the use of Intermediate
Representations
- Authors: Josh Collyer, Tim Watson and Iain Phillips
- Abstract summary: Cross-Architecture Binary Code Similarity Search has been explored in numerous studies.
We propose Function as a String Encoded Representation (FASER) to create a model capable of cross architecture function search.
- Score: 0.8594140167290099
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Being able to identify functions of interest in cross-architecture software
is useful whether you are analysing for malware, securing the software supply
chain or conducting vulnerability research. Cross-Architecture Binary Code
Similarity Search has been explored in numerous studies and has used a wide
range of different data sources to achieve its goals. The data sources
typically used draw on common structures derived from binaries such as function
control flow graphs or binary level call graphs, the output of the disassembly
process or the outputs of a dynamic analysis approach. One data source which
has received less attention is binary intermediate representations. Binary
Intermediate representations possess two interesting properties: they are cross
architecture by their very nature and encode the semantics of a function
explicitly to support downstream usage. Within this paper we propose Function
as a String Encoded Representation (FASER) which combines long document
transformers with the use of intermediate representations to create a model
capable of cross architecture function search without the need for manual
feature engineering, pre-training or a dynamic analysis step. We compare our
approach against a series of baseline approaches for two tasks; A general
function search task and a targeted vulnerability search task. Our approach
demonstrates strong performance across both tasks, performing better than all
baseline approaches.
Related papers
- Know Your Neighborhood: General and Zero-Shot Capable Binary Function Search Powered by Call Graphlets [0.7646713951724013]
This paper proposes a novel graph neural network architecture combined with a novel graph data representation called call graphlets.
A specialized graph neural network model is then designed to operate on this graph representation, learning to map it to a feature vector that encodes semantic code similarities.
Experimental results demonstrate that the combination of call graphlets and the novel graph neural network architecture achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-06-02T18:26:50Z) - Cross-Inlining Binary Function Similarity Detection [16.923959153965857]
We propose a pattern-based model named CI-Detector for cross-inlining matching.
Results show that CI-Detector can detect cross-inlining pairs with a precision of 81% and a recall of 97%, which exceeds all state-of-the-art works.
arXiv Detail & Related papers (2024-01-11T08:42:08Z) - TRIAD: Automated Traceability Recovery based on Biterm-enhanced
Deduction of Transitive Links among Artifacts [53.92293118080274]
Traceability allows stakeholders to extract and comprehend the trace links among software artifacts introduced across the software life cycle.
Most rely on textual similarities among software artifacts, such as those based on Information Retrieval (IR)
arXiv Detail & Related papers (2023-12-28T06:44:24Z) - Feature Decoupling-Recycling Network for Fast Interactive Segmentation [79.22497777645806]
Recent interactive segmentation methods iteratively take source image, user guidance and previously predicted mask as the input.
We propose the Feature Decoupling-Recycling Network (FDRN), which decouples the modeling components based on their intrinsic discrepancies.
arXiv Detail & Related papers (2023-08-07T12:26:34Z) - UniASM: Binary Code Similarity Detection without Fine-tuning [0.8271859911016718]
We propose a novel transformer-based binary code embedding model named UniASM to learn representations of the binary functions.
In the real-world task of known vulnerability search, UniASM outperforms all the current baselines.
arXiv Detail & Related papers (2022-10-28T14:04:57Z) - Correlation-Aware Deep Tracking [83.51092789908677]
We propose a novel target-dependent feature network inspired by the self-/cross-attention scheme.
Our network deeply embeds cross-image feature correlation in multiple layers of the feature network.
Our model can be flexibly pre-trained on abundant unpaired images, leading to notably faster convergence than the existing methods.
arXiv Detail & Related papers (2022-03-03T11:53:54Z) - Compositional Attention: Disentangling Search and Retrieval [66.7108739597771]
Multi-head, key-value attention is the backbone of the Transformer model and its variants.
Standard attention heads learn a rigid mapping between search and retrieval.
We propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure.
arXiv Detail & Related papers (2021-10-18T15:47:38Z) - Semantic-aware Binary Code Representation with BERT [27.908093567605484]
A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code.
Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary.
In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code.
arXiv Detail & Related papers (2021-06-10T03:31:29Z) - Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for
Improved Cross-Modal Retrieval [80.35589927511667]
Current state-of-the-art approaches to cross-modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image.
We propose a novel fine-tuning framework which turns any pretrained text-image multi-modal model into an efficient retrieval model.
Our experiments on a series of standard cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot setups, demonstrate improved accuracy and huge efficiency benefits over the state-of-the-art cross-encoders.
arXiv Detail & Related papers (2021-03-22T15:08:06Z) - Comparative Code Structure Analysis using Deep Learning for Performance
Prediction [18.226950022938954]
This paper aims to assess the feasibility of using purely static information (e.g., abstract syntax tree or AST) of applications to predict performance change based on the change in code structure.
Our evaluations of several deep embedding learning methods demonstrate that tree-based Long Short-Term Memory (LSTM) models can leverage the hierarchical structure of source-code to discover latent representations and achieve up to 84% (individual problem) and 73% (combined dataset with multiple of problems) accuracy in predicting the change in performance.
arXiv Detail & Related papers (2021-02-12T16:59:12Z) - Unsupervised Deep Cross-modality Spectral Hashing [65.3842441716661]
The framework is a two-step hashing approach which decouples the optimization into binary optimization and hashing function learning.
We propose a novel spectral embedding-based algorithm to simultaneously learn single-modality and binary cross-modality representations.
We leverage the powerful CNN for images and propose a CNN-based deep architecture to learn text modality.
arXiv Detail & Related papers (2020-08-01T09:20:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.