SimCLF: A Simple Contrastive Learning Framework for Function-level
Binary Embeddings
- URL: http://arxiv.org/abs/2209.02442v2
- Date: Tue, 26 Dec 2023 17:11:43 GMT
- Title: SimCLF: A Simple Contrastive Learning Framework for Function-level
Binary Embeddings
- Authors: Sun RuiJin, Guo Shize, Guo Jinhong, Li Wei, Zhan Dazhi, Sun Meng, Pan
Zhisong
- Abstract summary: We propose SimCLF: A Simple Contrastive Learning Framework for Function-level Binary Embeddings.
We take an unsupervised learning approach and formulate binary code similarity detection as instance discrimination.
SimCLF directly operates on disassembled binary functions and could be implemented with any encoder.
- Score: 2.1222884030559315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Function-level binary code similarity detection is a crucial aspect of
cybersecurity. It enables the detection of bugs and patent infringements in
released software and plays a pivotal role in preventing supply chain attacks.
A practical embedding learning framework relies on the robustness of the
assembly code representation and the accuracy of function-pair annotation,
which is traditionally accomplished using supervised learning-based frameworks.
However, annotating different function pairs with accurate labels poses
considerable challenges. These supervised learning methods can be easily
overtrained and suffer from representation robustness problems. To address
these challenges, we propose SimCLF: A Simple Contrastive Learning Framework
for Function-level Binary Embeddings. We take an unsupervised learning approach
and formulate binary code similarity detection as instance discrimination.
SimCLF directly operates on disassembled binary functions and could be
implemented with any encoder. It does not require manually annotated
information but only augmented data. Augmented data is generated using compiler
optimization options and code obfuscation techniques. The experimental results
demonstrate that SimCLF surpasses the state-of-the-art in accuracy and has a
significant advantage in few-shot settings.
Related papers
- Probably Approximately Precision and Recall Learning [62.912015491907994]
Precision and Recall are foundational metrics in machine learning.
One-sided feedback--where only positive examples are observed during training--is inherent in many practical problems.
We introduce a PAC learning framework where each hypothesis is represented by a graph, with edges indicating positive interactions.
arXiv Detail & Related papers (2024-11-20T04:21:07Z) - Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations [52.34030226129628]
Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification.
In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction.
Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.
arXiv Detail & Related papers (2024-10-24T09:09:20Z) - FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [54.27040631527217]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries.
We first build a binary large language model (FoC-BinLLM) to summarize the semantics of cryptographic functions in natural language.
We then build a binary code similarity model (FoC-Sim) upon the FoC-BinLLM to create change-sensitive representations and use it to retrieve similar implementations of unknown cryptographic functions in a database.
arXiv Detail & Related papers (2024-03-27T09:45:33Z) - CLAP: Learning Transferable Binary Code Representations with Natural
Language Supervision [22.42846252594693]
We present CLAP (Contrastive Language-Assembly Pre-training), which employs natural language supervision to learn better representations of binary code.
At the core, our approach boosts superior transfer learning capabilities by effectively aligning binary code with their semantics explanations.
We have generated 195 million pairs of binary code and explanations and trained a prototype of CLAP.
arXiv Detail & Related papers (2024-02-26T13:49:52Z) - TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation [9.477734501499274]
We present TransformCode, a novel framework that learns code embeddings in a contrastive learning manner.
Our framework is encoder-agnostic and language-agnostic, which means that it can leverage any encoder model and handle any programming language.
arXiv Detail & Related papers (2023-11-10T09:05:23Z) - Adversarial Training with Complementary Labels: On the Benefit of
Gradually Informative Attacks [119.38992029332883]
Adversarial training with imperfect supervision is significant but receives limited attention.
We propose a new learning strategy using gradually informative attacks.
Experiments are conducted to demonstrate the effectiveness of our method on a range of benchmarked datasets.
arXiv Detail & Related papers (2022-11-01T04:26:45Z) - Statement-Level Vulnerability Detection: Learning Vulnerability Patterns Through Information Theory and Contrastive Learning [31.15123852246431]
We propose a novel end-to-end deep learning-based approach to identify the vulnerability-relevant code statements of a specific function.
Inspired by the structures observed in real-world vulnerable code, we first leverage mutual information for learning a set of latent variables.
We then propose novel clustered spatial contrastive learning in order to further improve the representation learning.
arXiv Detail & Related papers (2022-09-20T00:46:20Z) - FuncFooler: A Practical Black-box Attack Against Learning-based Binary
Code Similarity Detection Methods [13.694322857909166]
This paper designs an efficient and black-box adversarial code generation algorithm, namely, FuncFooler.
FuncFooler constrains the adversarial codes to keep unchanged the program's control flow graph (CFG), and to preserve the same semantic meaning.
Empirically, our FuncFooler can successfully attack the three learning-based BCSD models, including SAFE, Asm2Vec, and jTrans.
arXiv Detail & Related papers (2022-08-26T01:58:26Z) - Joint Inductive and Transductive Learning for Video Object Segmentation [107.32760625159301]
Semi-supervised object segmentation is a task of segmenting the target object in a video sequence given only a mask in the first frame.
Most previous best-performing methods adopt matching-based transductive reasoning or online inductive learning.
We propose to integrate transductive and inductive learning into a unified framework to exploit complement between them for accurate and robust video object segmentation.
arXiv Detail & Related papers (2021-08-08T16:25:48Z) - Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph.
It is updated by decoding in the context of an auto-encoder.
Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.