CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity
Detection
- URL: http://arxiv.org/abs/2402.18818v1
- Date: Thu, 29 Feb 2024 03:02:07 GMT
- Title: CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity
Detection
- Authors: Hao Wang, Zeyu Gao, Chao Zhang, Mingyang Sun, Yuchen Zhou, Han Qiu, Xi
Xiao
- Abstract summary: Binary code similarity detection (BCSD) is a fundamental technique for various application.
We propose a cost-effective BCSD framework, CEBin, which fuses embedding-based and comparison-based approaches.
- Score: 23.8834126695488
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Binary code similarity detection (BCSD) is a fundamental technique for
various application. Many BCSD solutions have been proposed recently, which
mostly are embedding-based, but have shown limited accuracy and efficiency
especially when the volume of target binaries to search is large. To address
this issue, we propose a cost-effective BCSD framework, CEBin, which fuses
embedding-based and comparison-based approaches to significantly improve
accuracy while minimizing overheads. Specifically, CEBin utilizes a refined
embedding-based approach to extract features of target code, which efficiently
narrows down the scope of candidate similar code and boosts performance. Then,
it utilizes a comparison-based approach that performs a pairwise comparison on
the candidates to capture more nuanced and complex relationships, which greatly
improves the accuracy of similarity detection. By bridging the gap between
embedding-based and comparison-based approaches, CEBin is able to provide an
effective and efficient solution for detecting similar code (including
vulnerable ones) in large-scale software ecosystems. Experimental results on
three well-known datasets demonstrate the superiority of CEBin over existing
state-of-the-art (SOTA) baselines. To further evaluate the usefulness of BCSD
in real world, we construct a large-scale benchmark of vulnerability, offering
the first precise evaluation scheme to assess BCSD methods for the 1-day
vulnerability detection task. CEBin could identify the similar function from
millions of candidate functions in just a few seconds and achieves an
impressive recall rate of $85.46\%$ on this more practical but challenging
task, which are several order of magnitudes faster and $4.07\times$ better than
the best SOTA baseline. Our code is available at
https://github.com/Hustcw/CEBin.
Related papers
- Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations [52.34030226129628]
Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification.
In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction.
Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.
arXiv Detail & Related papers (2024-10-24T09:09:20Z) - BinSimDB: Benchmark Dataset Construction for Fine-Grained Binary Code Similarity Analysis [6.093226756571566]
We construct a benchmark dataset for fine-grained binary code similarity analysis called BinSimDB.
Specifically, we propose BMerge and BPair algorithms to bridge the discrepancies between two binary code snippets.
The experimental results demonstrate that BinSimDB significantly improves the performance of binary code similarity comparison.
arXiv Detail & Related papers (2024-10-14T05:13:48Z) - CARE: Confidence-rich Autonomous Robot Exploration using Bayesian Kernel
Inference and Optimization [12.32946442160165]
We consider improving the efficiency of information-based autonomous robot exploration in unknown and complex environments.
We propose a novel lightweight information gain inference method based on Bayesian kernel inference and optimization (BKIO)
We show the desired efficiency of our proposed methods without losing exploration performance in different unstructured, cluttered environments.
arXiv Detail & Related papers (2023-09-11T02:30:06Z) - Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone
Detection [0.0]
SSCD is a BERT-based clone detection approach that targets high recall of Type 3 and Type 4 clones at scale.
It does so by computing a representative embedding for each code fragment and finding similar fragments using a nearest neighbour search.
This paper details the approach and an empirical assessment towards configuring and evaluating that approach in industrial setting.
arXiv Detail & Related papers (2023-09-05T12:38:55Z) - A Comprehensively Improved Hybrid Algorithm for Learning Bayesian
Networks: Multiple Compound Memory Erasing [0.0]
This paper presents a new hybrid algorithm, MCME (multiple compound memory erasing)
MCME retains the advantages of the first two methods, solves the shortcomings of the above CI tests, and makes innovations in the scoring function in the direction discrimination stage.
A large number of experiments show that MCME has better or similar performance than some existing algorithms.
arXiv Detail & Related papers (2022-12-05T12:52:07Z) - UniASM: Binary Code Similarity Detection without Fine-tuning [0.8271859911016718]
We propose a novel transformer-based binary code embedding model named UniASM to learn representations of the binary functions.
In the real-world task of known vulnerability search, UniASM outperforms all the current baselines.
arXiv Detail & Related papers (2022-10-28T14:04:57Z) - Efficient Nearest Neighbor Search for Cross-Encoder Models using Matrix
Factorization [60.91600465922932]
We present an approach that avoids the use of a dual-encoder for retrieval, relying solely on the cross-encoder.
Our approach provides test-time recall-vs-computational cost trade-offs superior to the current widely-used methods.
arXiv Detail & Related papers (2022-10-23T00:32:04Z) - Asymmetric Scalable Cross-modal Hashing [51.309905690367835]
Cross-modal hashing is a successful method to solve large-scale multimedia retrieval issue.
We propose a novel Asymmetric Scalable Cross-Modal Hashing (ASCMH) to address these issues.
Our ASCMH outperforms the state-of-the-art cross-modal hashing methods in terms of accuracy and efficiency.
arXiv Detail & Related papers (2022-07-26T04:38:47Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z) - Beta-CROWN: Efficient Bound Propagation with Per-neuron Split
Constraints for Complete and Incomplete Neural Network Verification [151.62491805851107]
We develop $beta$-CROWN, a bound propagation based verifier that can fully encode per-neuron splits.
$beta$-CROWN is close to three orders of magnitude faster than LP-based BaB methods for robustness verification.
By terminating BaB early, our method can also be used for incomplete verification.
arXiv Detail & Related papers (2021-03-11T11:56:54Z) - Bayesian Optimization with Machine Learning Algorithms Towards Anomaly
Detection [66.05992706105224]
In this paper, an effective anomaly detection framework is proposed utilizing Bayesian Optimization technique.
The performance of the considered algorithms is evaluated using the ISCX 2012 dataset.
Experimental results show the effectiveness of the proposed framework in term of accuracy rate, precision, low-false alarm rate, and recall.
arXiv Detail & Related papers (2020-08-05T19:29:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.