UniASM: Binary Code Similarity Detection without Fine-tuning
- URL: http://arxiv.org/abs/2211.01144v3
- Date: Thu, 6 Apr 2023 04:49:49 GMT
- Title: UniASM: Binary Code Similarity Detection without Fine-tuning
- Authors: Yeming Gu, Hui Shu and Fan Hu
- Abstract summary: We propose a novel transformer-based binary code embedding model named UniASM to learn representations of the binary functions.
In the real-world task of known vulnerability search, UniASM outperforms all the current baselines.
- Score: 0.8271859911016718
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Binary code similarity detection (BCSD) is widely used in various binary
analysis tasks such as vulnerability search, malware detection, clone
detection, and patch analysis. Recent studies have shown that the
learning-based binary code embedding models perform better than the traditional
feature-based approaches. In this paper, we propose a novel transformer-based
binary code embedding model named UniASM to learn representations of the binary
functions. We design two new training tasks to make the spatial distribution of
the generated vectors more uniform, which can be used directly in BCSD without
any fine-tuning. In addition, we present a new tokenization approach for binary
functions, which increases the token's semantic information and mitigates the
out-of-vocabulary (OOV) problem. We conduct an in-depth analysis of the factors
affecting model performance through ablation experiments and obtain some new
and valuable findings. The experimental results show that UniASM outperforms
the state-of-the-art (SOTA) approach on the evaluation dataset. The average
scores of Recall@1 on cross-compilers, cross-optimization levels, and
cross-obfuscations are 0.77, 0.72, and 0.72. Besides, in the real-world task of
known vulnerability search, UniASM outperforms all the current baselines.
Related papers
- A Progressive Transformer for Unifying Binary Code Embedding and Knowledge Transfer [15.689556592544667]
We introduce ProTST, a novel transformer-based methodology for binary code embedding.
ProTST employs a hierarchical training process based on a unique tree-like structure.
Results show that ProTST yields an average validation score (F1, MRR, and Recall@1) improvement of 14.8% compared to traditional two-stage training.
arXiv Detail & Related papers (2024-12-15T13:04:29Z) - USDRL: Unified Skeleton-Based Dense Representation Learning with Multi-Grained Feature Decorrelation [24.90512145836643]
We introduce a Unified Skeleton-based Dense Representation Learning framework based on feature decorrelation.
We show that our approach significantly outperforms the current state-of-the-art (SOTA) approaches.
arXiv Detail & Related papers (2024-12-12T12:20:27Z) - Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations [52.34030226129628]
Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification.
In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction.
Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.
arXiv Detail & Related papers (2024-10-24T09:09:20Z) - Revisiting BPR: A Replicability Study of a Common Recommender System Baseline [78.00363373925758]
We study the features of the BPR model, indicating their impact on its performance, and investigate open-source BPR implementations.
Our analysis reveals inconsistencies between these implementations and the original BPR paper, leading to a significant decrease in performance of up to 50% for specific implementations.
We show that the BPR model can achieve performance levels close to state-of-the-art methods on the top-n recommendation tasks and even outperform them on specific datasets.
arXiv Detail & Related papers (2024-09-21T18:39:53Z) - Source Code Foundation Models are Transferable Binary Analysis Knowledge Bases [9.422025563792818]
Human-Oriented Binary Reverse Engineering aims to lift binary code to human-readable content relevant to source code.
We introduce a novel probe-and-recover framework that incorporates a binary-source encoder-decoder model and black-box LLMs for binary analysis.
arXiv Detail & Related papers (2024-05-30T00:17:44Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Learning Similarity Preserving Binary Codes for Recommender Systems [5.799838997511804]
We study an unexplored module combination for the hashing-based recommender systems, namely Compact Cross-Similarity Recommender (CCSR)
Inspired by cross-modal retrieval, CCSR utilizes a Posteriori similarity instead of matrix factorization and rating reconstruction to model interactions between users and items.
On the MovieLens1M dataset, the absolute performance improvements are up to 15.69% in NDCG and 4.29% in Recall.
arXiv Detail & Related papers (2022-04-18T21:33:59Z) - BCFNet: A Balanced Collaborative Filtering Network with Attention
Mechanism [106.43103176833371]
Collaborative Filtering (CF) based recommendation methods have been widely studied.
We propose a novel recommendation model named Balanced Collaborative Filtering Network (BCFNet)
In addition, an attention mechanism is designed to better capture the hidden information within implicit feedback and strengthen the learning ability of the neural network.
arXiv Detail & Related papers (2021-03-10T14:59:23Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - Symbiotic Adversarial Learning for Attribute-based Person Search [86.7506832053208]
We present a symbiotic adversarial learning framework, called SAL.Two GANs sit at the base of the framework in a symbiotic learning scheme.
Specifically, two different types of generative adversarial networks learn collaboratively throughout the training process.
arXiv Detail & Related papers (2020-07-19T07:24:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.