Pluvio: Assembly Clone Search for Out-of-domain Architectures and
Libraries through Transfer Learning and Conditional Variational Information
Bottleneck
- URL: http://arxiv.org/abs/2307.10631v1
- Date: Thu, 20 Jul 2023 06:55:37 GMT
- Title: Pluvio: Assembly Clone Search for Out-of-domain Architectures and
Libraries through Transfer Learning and Conditional Variational Information
Bottleneck
- Authors: Zhiwei Fu, Steven H. H. Ding, Furkan Alaca, Benjamin C. M. Fung,
Philippe Charland
- Abstract summary: Assembly clone search has been effective in identifying vulnerable code resulting from reuse in released executables.
Recent studies on assembly clone search demonstrate a trend towards using machine learning-based methods to match assembly code variants.
We propose incorporating human common knowledge through large-scale pre-trained natural language models, in the form of transfer learning, into current learning-based approaches for assembly clone search.
- Score: 6.230859543111394
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The practice of code reuse is crucial in software development for a faster
and more efficient development lifecycle. In reality, however, code reuse
practices lack proper control, resulting in issues such as vulnerability
propagation and intellectual property infringements. Assembly clone search, a
critical shift-right defence mechanism, has been effective in identifying
vulnerable code resulting from reuse in released executables. Recent studies on
assembly clone search demonstrate a trend towards using machine learning-based
methods to match assembly code variants produced by different toolchains.
However, these methods are limited to what they learn from a small number of
toolchain variants used in training, rendering them inapplicable to unseen
architectures and their corresponding compilation toolchain variants.
This paper presents the first study on the problem of assembly clone search
with unseen architectures and libraries. We propose incorporating human common
knowledge through large-scale pre-trained natural language models, in the form
of transfer learning, into current learning-based approaches for assembly clone
search. Transfer learning can aid in addressing the limitations of the existing
approaches, as it can bring in broader knowledge from human experts in assembly
code. We further address the sequence limit issue by proposing a reinforcement
learning agent to remove unnecessary and redundant tokens. Coupled with a new
Variational Information Bottleneck learning strategy, the proposed system
minimizes the reliance on potential indicators of architectures and
optimization settings, for a better generalization of unseen architectures. We
simulate the unseen architecture clone search scenarios and the experimental
results show the effectiveness of the proposed approach against the
state-of-the-art solutions.
Related papers
- ToolCoder: A Systematic Code-Empowered Tool Learning Framework for Large Language Models [49.04652315815501]
Tool learning has emerged as a crucial capability for large language models (LLMs) to solve complex real-world tasks through interaction with external tools.
We propose ToolCoder, a novel framework that reformulates tool learning as a code generation task.
arXiv Detail & Related papers (2025-02-17T03:42:28Z) - Idioms: Neural Decompilation With Joint Code and Type Prediction [7.421408987075001]
We introduce a new training process to finetune any LLM into a neural decompiler capable of generating the appropriate user-defined types alongside the decompilation.
Motivated by the intuition that different parts of data structures can be operated upon by different parts of the program, we show that interprocedural context can help improve neural decompilers' ability to handle user-defined types.
arXiv Detail & Related papers (2025-02-06T22:13:40Z) - Multimodal Instruction Disassembly with Covariate Shift Adaptation and Real-time Implementation [3.70729078195191]
We introduce a new miniature platform, RASCv3, that can simultaneously collect power and EM measurements from a target device.
We devise a new approach to combine and select features from power and EM traces using information theory.
The recognition rates of offline and real-time instruction disassemblers are compared for single- and multi-modal cases.
arXiv Detail & Related papers (2024-12-10T17:00:23Z) - Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations [52.34030226129628]
Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification.
In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction.
Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.
arXiv Detail & Related papers (2024-10-24T09:09:20Z) - Toward Exploring the Code Understanding Capabilities of Pre-trained Code Generation Models [12.959392500354223]
We pioneer the transfer of knowledge from pre-trained code generation models to code understanding tasks.
We introduce CL4D, a contrastive learning method designed to enhance the representation capabilities of decoder-only models.
arXiv Detail & Related papers (2024-06-18T06:52:14Z) - Bayesian Program Learning by Decompiling Amortized Knowledge [50.960612835957875]
We present a novel approach for library learning that directly leverages the neural search policy, effectively "decompiling" its amortized knowledge to extract relevant program components.
This provides stronger amortized inference: the amortized knowledge learnt to reduce search breadth is now also used to reduce search depth.
arXiv Detail & Related papers (2023-06-13T15:35:01Z) - CONCORD: Clone-aware Contrastive Learning for Source Code [64.51161487524436]
Self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks.
We argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning.
In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart.
arXiv Detail & Related papers (2023-06-05T20:39:08Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - LibFewShot: A Comprehensive Library for Few-shot Learning [78.58842209282724]
Few-shot learning, especially few-shot image classification, has received increasing attention and witnessed significant advances in recent years.
Some recent studies implicitly show that many generic techniques or tricks, such as data augmentation, pre-training, knowledge distillation, and self-supervision, may greatly boost the performance of a few-shot learning method.
We propose a comprehensive library for few-shot learning (LibFewShot) by re-implementing seventeen state-of-the-art few-shot learning methods in a unified framework with the same single intrinsic in PyTorch.
arXiv Detail & Related papers (2021-09-10T14:12:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.