BinBert: Binary Code Understanding with a Fine-tunable and
Execution-aware Transformer
- URL: http://arxiv.org/abs/2208.06692v1
- Date: Sat, 13 Aug 2022 17:48:52 GMT
- Title: BinBert: Binary Code Understanding with a Fine-tunable and
Execution-aware Transformer
- Authors: Fiorella Artuso, Marco Mormando, Giuseppe A. Di Luna, Leonardo
Querzoni
- Abstract summary: In this paper we present BinBert, a novel assembly code model.
BinBert is built on a transformer pre-trained on a huge dataset of both assembly instruction sequences and symbolic execution information.
Through fine-tuning, BinBert learns how to apply the general knowledge acquired with pre-training to the specific task.
- Score: 2.8523943706562638
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A recent trend in binary code analysis promotes the use of neural solutions
based on instruction embedding models. An instruction embedding model is a
neural network that transforms sequences of assembly instructions into
embedding vectors. If the embedding network is trained such that the
translation from code to vectors partially preserves the semantic, the network
effectively represents an assembly code model.
In this paper we present BinBert, a novel assembly code model. BinBert is
built on a transformer pre-trained on a huge dataset of both assembly
instruction sequences and symbolic execution information. BinBert can be
applied to assembly instructions sequences and it is fine-tunable, i.e. it can
be re-trained as part of a neural architecture on task-specific data. Through
fine-tuning, BinBert learns how to apply the general knowledge acquired with
pre-training to the specific task.
We evaluated BinBert on a multi-task benchmark that we specifically designed
to test the understanding of assembly code. The benchmark is composed of
several tasks, some taken from the literature, and a few novel tasks that we
designed, with a mix of intrinsic and downstream tasks.
Our results show that BinBert outperforms state-of-the-art models for binary
instruction embedding, raising the bar for binary code understanding.
Related papers
- BinSym: Binary-Level Symbolic Execution using Formal Descriptions of Instruction Semantics [2.4576576560952788]
BinSym is a framework for symbolic program analysis of software in binary form.
It operates directly on binary code instructions and does not require lifting them to an intermediate representation.
arXiv Detail & Related papers (2024-04-05T14:29:39Z) - CLAP: Learning Transferable Binary Code Representations with Natural
Language Supervision [22.42846252594693]
We present CLAP (Contrastive Language-Assembly Pre-training), which employs natural language supervision to learn better representations of binary code.
At the core, our approach boosts superior transfer learning capabilities by effectively aligning binary code with their semantics explanations.
We have generated 195 million pairs of binary code and explanations and trained a prototype of CLAP.
arXiv Detail & Related papers (2024-02-26T13:49:52Z) - CP-BCS: Binary Code Summarization Guided by Control Flow Graph and
Pseudo Code [79.87518649544405]
We present a control flow graph and pseudo code guided binary code summarization framework called CP-BCS.
CP-BCS utilizes a bidirectional instruction-level control flow graph and pseudo code that incorporates expert knowledge to learn the comprehensive binary function execution behavior and logic semantics.
arXiv Detail & Related papers (2023-10-24T14:20:39Z) - Soft-Labeled Contrastive Pre-training for Function-level Code
Representation [127.71430696347174]
We present textbfSCodeR, a textbfSoft-labeled contrastive pre-training framework with two positive sample construction methods.
Considering the relevance between codes in a large-scale code corpus, the soft-labeled contrastive pre-training can obtain fine-grained soft-labels.
SCodeR achieves new state-of-the-art performance on four code-related tasks over seven datasets.
arXiv Detail & Related papers (2022-10-18T05:17:37Z) - UniXcoder: Unified Cross-Modal Pre-training for Code Representation [65.6846553962117]
We present UniXcoder, a unified cross-modal pre-trained model for programming language.
We propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree.
We evaluate UniXcoder on five code-related tasks over nine datasets.
arXiv Detail & Related papers (2022-03-08T04:48:07Z) - CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations.
For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name.
For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z) - Semantic-aware Binary Code Representation with BERT [27.908093567605484]
A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code.
Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary.
In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code.
arXiv Detail & Related papers (2021-06-10T03:31:29Z) - PalmTree: Learning an Assembly Language Model for Instruction Embedding [8.74990895782223]
We propose to pre-train an assembly language model called PalmTree for generating general-purpose instruction embeddings.
PalmTree has the best performance for intrinsic metrics, and outperforms the other instruction embedding schemes for all downstream tasks.
arXiv Detail & Related papers (2021-01-21T22:30:01Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z) - Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph.
It is updated by decoding in the context of an auto-encoder.
Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.