Related papers: CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity Detection

CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity Detection

URL: http://arxiv.org/abs/2402.18818v1
Date: Thu, 29 Feb 2024 03:02:07 GMT
Title: CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity Detection
Authors: Hao Wang, Zeyu Gao, Chao Zhang, Mingyang Sun, Yuchen Zhou, Han Qiu, Xi Xiao
Abstract summary: Binary code similarity detection (BCSD) is a fundamental technique for various application. We propose a cost-effective BCSD framework, CEBin, which fuses embedding-based and comparison-based approaches.
Score: 23.8834126695488
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Binary code similarity detection (BCSD) is a fundamental technique for various application. Many BCSD solutions have been proposed recently, which mostly are embedding-based, but have shown limited accuracy and efficiency especially when the volume of target binaries to search is large. To address this issue, we propose a cost-effective BCSD framework, CEBin, which fuses embedding-based and comparison-based approaches to significantly improve accuracy while minimizing overheads. Specifically, CEBin utilizes a refined embedding-based approach to extract features of target code, which efficiently narrows down the scope of candidate similar code and boosts performance. Then, it utilizes a comparison-based approach that performs a pairwise comparison on the candidates to capture more nuanced and complex relationships, which greatly improves the accuracy of similarity detection. By bridging the gap between embedding-based and comparison-based approaches, CEBin is able to provide an effective and efficient solution for detecting similar code (including vulnerable ones) in large-scale software ecosystems. Experimental results on three well-known datasets demonstrate the superiority of CEBin over existing state-of-the-art (SOTA) baselines. To further evaluate the usefulness of BCSD in real world, we construct a large-scale benchmark of vulnerability, offering the first precise evaluation scheme to assess BCSD methods for the 1-day vulnerability detection task. CEBin could identify the similar function from millions of candidate functions in just a few seconds and achieves an impressive recall rate of $85.46\%$ on this more practical but challenging task, which are several order of magnitudes faster and $4.07\times$ better than the best SOTA baseline. Our code is available at https://github.com/Hustcw/CEBin.

Related papers

Anytime Cooperative Implicit Hitting Set Solving [46.010796136659536]
The Implicit Hitting Set (HS) approach has shown to be very effective for MaxSAT, Pseudo-boolean optimization and other frameworks. We show how it can be easily combined in a multithread architecture where cores discovered by either component are available. We show that the resulting algorithm (HS-lub) is consistently superior to either HS-lb and HS-ub in isolation.
arXiv Detail & Related papers (2025-01-14T07:23:52Z)
Efficient Approximate Degenerate Ordered Statistics Decoding for Quantum Codes via Reliable Subset Reduction [5.625796693054094]
We introduce the concept of approximate degenerate decoding and integrate it with ordered statistics decoding (OSD) We present an ADOSD algorithm that significantly improves OSD efficiency in the code capacity noise model.
arXiv Detail & Related papers (2024-12-30T17:45:08Z)
Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations [52.34030226129628]
Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification. In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction. Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.
arXiv Detail & Related papers (2024-10-24T09:09:20Z)
BinSimDB: Benchmark Dataset Construction for Fine-Grained Binary Code Similarity Analysis [6.093226756571566]
We construct a benchmark dataset for fine-grained binary code similarity analysis called BinSimDB. Specifically, we propose BMerge and BPair algorithms to bridge the discrepancies between two binary code snippets. The experimental results demonstrate that BinSimDB significantly improves the performance of binary code similarity comparison.
arXiv Detail & Related papers (2024-10-14T05:13:48Z)
Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching [53.05954114863596]
We propose a brand-new Deep Boosting Learning (DBL) algorithm for image-text matching. An anchor branch is first trained to provide insights into the data properties. A target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples.
arXiv Detail & Related papers (2024-04-28T08:44:28Z)
CARE: Confidence-rich Autonomous Robot Exploration using Bayesian Kernel Inference and Optimization [12.32946442160165]
We consider improving the efficiency of information-based autonomous robot exploration in unknown and complex environments. We propose a novel lightweight information gain inference method based on Bayesian kernel inference and optimization (BKIO) We show the desired efficiency of our proposed methods without losing exploration performance in different unstructured, cluttered environments.
arXiv Detail & Related papers (2023-09-11T02:30:06Z)
Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone Detection [0.0]
SSCD is a BERT-based clone detection approach that targets high recall of Type 3 and Type 4 clones at scale. It does so by computing a representative embedding for each code fragment and finding similar fragments using a nearest neighbour search. This paper details the approach and an empirical assessment towards configuring and evaluating that approach in industrial setting.
arXiv Detail & Related papers (2023-09-05T12:38:55Z)
A Comprehensively Improved Hybrid Algorithm for Learning Bayesian Networks: Multiple Compound Memory Erasing [0.0]
This paper presents a new hybrid algorithm, MCME (multiple compound memory erasing) MCME retains the advantages of the first two methods, solves the shortcomings of the above CI tests, and makes innovations in the scoring function in the direction discrimination stage. A large number of experiments show that MCME has better or similar performance than some existing algorithms.
arXiv Detail & Related papers (2022-12-05T12:52:07Z)
UniASM: Binary Code Similarity Detection without Fine-tuning [0.8271859911016718]
We propose a novel transformer-based binary code embedding model named UniASM to learn representations of the binary functions. In the real-world task of known vulnerability search, UniASM outperforms all the current baselines.
arXiv Detail & Related papers (2022-10-28T14:04:57Z)
Efficient Nearest Neighbor Search for Cross-Encoder Models using Matrix Factorization [60.91600465922932]
We present an approach that avoids the use of a dual-encoder for retrieval, relying solely on the cross-encoder. Our approach provides test-time recall-vs-computational cost trade-offs superior to the current widely-used methods.
arXiv Detail & Related papers (2022-10-23T00:32:04Z)
Asymmetric Scalable Cross-modal Hashing [51.309905690367835]
Cross-modal hashing is a successful method to solve large-scale multimedia retrieval issue. We propose a novel Asymmetric Scalable Cross-Modal Hashing (ASCMH) to address these issues. Our ASCMH outperforms the state-of-the-art cross-modal hashing methods in terms of accuracy and efficiency.
arXiv Detail & Related papers (2022-07-26T04:38:47Z)
Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples. We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment. We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z)
Beta-CROWN: Efficient Bound Propagation with Per-neuron Split Constraints for Complete and Incomplete Neural Network Verification [151.62491805851107]
We develop $beta$-CROWN, a bound propagation based verifier that can fully encode per-neuron splits. $beta$-CROWN is close to three orders of magnitude faster than LP-based BaB methods for robustness verification. By terminating BaB early, our method can also be used for incomplete verification.
arXiv Detail & Related papers (2021-03-11T11:56:54Z)
Bayesian Optimization with Machine Learning Algorithms Towards Anomaly Detection [66.05992706105224]
In this paper, an effective anomaly detection framework is proposed utilizing Bayesian Optimization technique. The performance of the considered algorithms is evaluated using the ISCX 2012 dataset. Experimental results show the effectiveness of the proposed framework in term of accuracy rate, precision, low-false alarm rate, and recall.
arXiv Detail & Related papers (2020-08-05T19:29:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.