SimCLF: A Simple Contrastive Learning Framework for Function-level
Binary Embeddings
- URL: http://arxiv.org/abs/2209.02442v2
- Date: Tue, 26 Dec 2023 17:11:43 GMT
- Title: SimCLF: A Simple Contrastive Learning Framework for Function-level
Binary Embeddings
- Authors: Sun RuiJin, Guo Shize, Guo Jinhong, Li Wei, Zhan Dazhi, Sun Meng, Pan
Zhisong
- Abstract summary: We propose SimCLF: A Simple Contrastive Learning Framework for Function-level Binary Embeddings.
We take an unsupervised learning approach and formulate binary code similarity detection as instance discrimination.
SimCLF directly operates on disassembled binary functions and could be implemented with any encoder.
- Score: 2.1222884030559315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Function-level binary code similarity detection is a crucial aspect of
cybersecurity. It enables the detection of bugs and patent infringements in
released software and plays a pivotal role in preventing supply chain attacks.
A practical embedding learning framework relies on the robustness of the
assembly code representation and the accuracy of function-pair annotation,
which is traditionally accomplished using supervised learning-based frameworks.
However, annotating different function pairs with accurate labels poses
considerable challenges. These supervised learning methods can be easily
overtrained and suffer from representation robustness problems. To address
these challenges, we propose SimCLF: A Simple Contrastive Learning Framework
for Function-level Binary Embeddings. We take an unsupervised learning approach
and formulate binary code similarity detection as instance discrimination.
SimCLF directly operates on disassembled binary functions and could be
implemented with any encoder. It does not require manually annotated
information but only augmented data. Augmented data is generated using compiler
optimization options and code obfuscation techniques. The experimental results
demonstrate that SimCLF surpasses the state-of-the-art in accuracy and has a
significant advantage in few-shot settings.
Related papers
- FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [54.27040631527217]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries.
FoC-BinLLM outperforms ChatGPT by 14.61% on the ROUGE-L score.
FoC-Sim outperforms the previous best methods with a 52% higher Recall@1.
arXiv Detail & Related papers (2024-03-27T09:45:33Z) - CLAP: Learning Transferable Binary Code Representations with Natural
Language Supervision [22.42846252594693]
We present CLAP (Contrastive Language-Assembly Pre-training), which employs natural language supervision to learn better representations of binary code.
At the core, our approach boosts superior transfer learning capabilities by effectively aligning binary code with their semantics explanations.
We have generated 195 million pairs of binary code and explanations and trained a prototype of CLAP.
arXiv Detail & Related papers (2024-02-26T13:49:52Z) - TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation [9.477734501499274]
We present TransformCode, a novel framework that learns code embeddings in a contrastive learning manner.
Our framework is encoder-agnostic and language-agnostic, which means that it can leverage any encoder model and handle any programming language.
arXiv Detail & Related papers (2023-11-10T09:05:23Z) - CONCORD: Clone-aware Contrastive Learning for Source Code [64.51161487524436]
Self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks.
We argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning.
In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart.
arXiv Detail & Related papers (2023-06-05T20:39:08Z) - Adversarial Training with Complementary Labels: On the Benefit of
Gradually Informative Attacks [119.38992029332883]
Adversarial training with imperfect supervision is significant but receives limited attention.
We propose a new learning strategy using gradually informative attacks.
Experiments are conducted to demonstrate the effectiveness of our method on a range of benchmarked datasets.
arXiv Detail & Related papers (2022-11-01T04:26:45Z) - The Devil is in the Details: On Models and Training Regimes for Few-Shot
Intent Classification [81.60168035505039]
Few-shot Classification (FSIC) is one of the key challenges in modular task-oriented dialog systems.
We show that cross-encoder architecture and episodic meta-learning consistently yields the best FSIC performance.
Our findings pave the way for conducting state-of-the-art research in FSIC.
arXiv Detail & Related papers (2022-10-12T17:37:54Z) - Statement-Level Vulnerability Detection: Learning Vulnerability Patterns Through Information Theory and Contrastive Learning [31.15123852246431]
We propose a novel end-to-end deep learning-based approach to identify the vulnerability-relevant code statements of a specific function.
Inspired by the structures observed in real-world vulnerable code, we first leverage mutual information for learning a set of latent variables.
We then propose novel clustered spatial contrastive learning in order to further improve the representation learning.
arXiv Detail & Related papers (2022-09-20T00:46:20Z) - FuncFooler: A Practical Black-box Attack Against Learning-based Binary
Code Similarity Detection Methods [13.694322857909166]
This paper designs an efficient and black-box adversarial code generation algorithm, namely, FuncFooler.
FuncFooler constrains the adversarial codes to keep unchanged the program's control flow graph (CFG), and to preserve the same semantic meaning.
Empirically, our FuncFooler can successfully attack the three learning-based BCSD models, including SAFE, Asm2Vec, and jTrans.
arXiv Detail & Related papers (2022-08-26T01:58:26Z) - Joint Inductive and Transductive Learning for Video Object Segmentation [107.32760625159301]
Semi-supervised object segmentation is a task of segmenting the target object in a video sequence given only a mask in the first frame.
Most previous best-performing methods adopt matching-based transductive reasoning or online inductive learning.
We propose to integrate transductive and inductive learning into a unified framework to exploit complement between them for accurate and robust video object segmentation.
arXiv Detail & Related papers (2021-08-08T16:25:48Z) - Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph.
It is updated by decoding in the context of an auto-encoder.
Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.