Pluvio: Assembly Clone Search for Out-of-domain Architectures and
Libraries through Transfer Learning and Conditional Variational Information
Bottleneck
- URL: http://arxiv.org/abs/2307.10631v1
- Date: Thu, 20 Jul 2023 06:55:37 GMT
- Title: Pluvio: Assembly Clone Search for Out-of-domain Architectures and
Libraries through Transfer Learning and Conditional Variational Information
Bottleneck
- Authors: Zhiwei Fu, Steven H. H. Ding, Furkan Alaca, Benjamin C. M. Fung,
Philippe Charland
- Abstract summary: Assembly clone search has been effective in identifying vulnerable code resulting from reuse in released executables.
Recent studies on assembly clone search demonstrate a trend towards using machine learning-based methods to match assembly code variants.
We propose incorporating human common knowledge through large-scale pre-trained natural language models, in the form of transfer learning, into current learning-based approaches for assembly clone search.
- Score: 6.230859543111394
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The practice of code reuse is crucial in software development for a faster
and more efficient development lifecycle. In reality, however, code reuse
practices lack proper control, resulting in issues such as vulnerability
propagation and intellectual property infringements. Assembly clone search, a
critical shift-right defence mechanism, has been effective in identifying
vulnerable code resulting from reuse in released executables. Recent studies on
assembly clone search demonstrate a trend towards using machine learning-based
methods to match assembly code variants produced by different toolchains.
However, these methods are limited to what they learn from a small number of
toolchain variants used in training, rendering them inapplicable to unseen
architectures and their corresponding compilation toolchain variants.
This paper presents the first study on the problem of assembly clone search
with unseen architectures and libraries. We propose incorporating human common
knowledge through large-scale pre-trained natural language models, in the form
of transfer learning, into current learning-based approaches for assembly clone
search. Transfer learning can aid in addressing the limitations of the existing
approaches, as it can bring in broader knowledge from human experts in assembly
code. We further address the sequence limit issue by proposing a reinforcement
learning agent to remove unnecessary and redundant tokens. Coupled with a new
Variational Information Bottleneck learning strategy, the proposed system
minimizes the reliance on potential indicators of architectures and
optimization settings, for a better generalization of unseen architectures. We
simulate the unseen architecture clone search scenarios and the experimental
results show the effectiveness of the proposed approach against the
state-of-the-art solutions.
Related papers
- Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations [52.34030226129628]
Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification.
In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction.
Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.
arXiv Detail & Related papers (2024-10-24T09:09:20Z) - Toward Exploring the Code Understanding Capabilities of Pre-trained Code Generation Models [12.959392500354223]
We pioneer the transfer of knowledge from pre-trained code generation models to code understanding tasks.
We introduce CL4D, a contrastive learning method designed to enhance the representation capabilities of decoder-only models.
arXiv Detail & Related papers (2024-06-18T06:52:14Z) - AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language.
We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z) - TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation [9.477734501499274]
We present TransformCode, a novel framework that learns code embeddings in a contrastive learning manner.
Our framework is encoder-agnostic and language-agnostic, which means that it can leverage any encoder model and handle any programming language.
arXiv Detail & Related papers (2023-11-10T09:05:23Z) - Towards Understanding the Capability of Large Language Models on Code
Clone Detection: A Survey [40.99060616674878]
Large language models (LLMs) possess diverse code-related knowledge, making them versatile for various software engineering challenges.
This paper provides the first comprehensive evaluation of LLMs for clone detection, covering different clone types, languages, and prompts.
We find advanced LLMs excel in detecting complex semantic clones, surpassing existing methods.
arXiv Detail & Related papers (2023-08-02T14:56:01Z) - Bayesian Program Learning by Decompiling Amortized Knowledge [50.960612835957875]
We present a novel approach for library learning that directly leverages the neural search policy, effectively "decompiling" its amortized knowledge to extract relevant program components.
This provides stronger amortized inference: the amortized knowledge learnt to reduce search breadth is now also used to reduce search depth.
arXiv Detail & Related papers (2023-06-13T15:35:01Z) - CONCORD: Clone-aware Contrastive Learning for Source Code [64.51161487524436]
Self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks.
We argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning.
In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart.
arXiv Detail & Related papers (2023-06-05T20:39:08Z) - SimCLF: A Simple Contrastive Learning Framework for Function-level
Binary Embeddings [2.1222884030559315]
We propose SimCLF: A Simple Contrastive Learning Framework for Function-level Binary Embeddings.
We take an unsupervised learning approach and formulate binary code similarity detection as instance discrimination.
SimCLF directly operates on disassembled binary functions and could be implemented with any encoder.
arXiv Detail & Related papers (2022-09-06T12:09:45Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - LibFewShot: A Comprehensive Library for Few-shot Learning [78.58842209282724]
Few-shot learning, especially few-shot image classification, has received increasing attention and witnessed significant advances in recent years.
Some recent studies implicitly show that many generic techniques or tricks, such as data augmentation, pre-training, knowledge distillation, and self-supervision, may greatly boost the performance of a few-shot learning method.
We propose a comprehensive library for few-shot learning (LibFewShot) by re-implementing seventeen state-of-the-art few-shot learning methods in a unified framework with the same single intrinsic in PyTorch.
arXiv Detail & Related papers (2021-09-10T14:12:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.