kTrans: Knowledge-Aware Transformer for Binary Code Embedding
- URL: http://arxiv.org/abs/2308.12659v1
- Date: Thu, 24 Aug 2023 09:07:11 GMT
- Title: kTrans: Knowledge-Aware Transformer for Binary Code Embedding
- Authors: Wenyu Zhu, Hao Wang, Yuchen Zhou, Jiaming Wang, Zihan Sha, Zeyu Gao,
Chao Zhang
- Abstract summary: We propose a novel Transformer-based approach, namely kTrans, to generate knowledge-aware binary code embedding.
We inspect the generated embeddings with outlier detection and visualization, and also apply kTrans to 3 downstream tasks: Binary Code Similarity Detection (BCSD), Function Type Recovery (FTR) and Indirect Call Recognition (ICR)
Evaluation results show that kTrans can generate high-quality binary code embeddings, and outperforms state-of-the-art (SOTA) approaches on downstream tasks by 5.2%, 6.8%, and 12.6% respectively.
- Score: 15.361622199889263
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Binary Code Embedding (BCE) has important applications in various reverse
engineering tasks such as binary code similarity detection, type recovery,
control-flow recovery and data-flow analysis. Recent studies have shown that
the Transformer model can comprehend the semantics of binary code to support
downstream tasks. However, existing models overlooked the prior knowledge of
assembly language. In this paper, we propose a novel Transformer-based
approach, namely kTrans, to generate knowledge-aware binary code embedding. By
feeding explicit knowledge as additional inputs to the Transformer, and fusing
implicit knowledge with a novel pre-training task, kTrans provides a new
perspective to incorporating domain knowledge into a Transformer framework. We
inspect the generated embeddings with outlier detection and visualization, and
also apply kTrans to 3 downstream tasks: Binary Code Similarity Detection
(BCSD), Function Type Recovery (FTR) and Indirect Call Recognition (ICR).
Evaluation results show that kTrans can generate high-quality binary code
embeddings, and outperforms state-of-the-art (SOTA) approaches on downstream
tasks by 5.2%, 6.8%, and 12.6% respectively. kTrans is publicly available at:
https://github.com/Learner0x5a/kTrans-release
Related papers
- CrossMPT: Cross-attention Message-Passing Transformer for Error Correcting Codes [14.631435001491514]
We propose a novel Cross-attention Message-Passing Transformer (CrossMPT)
We show that CrossMPT significantly outperforms existing neural network-based decoders for various code classes.
Notably, CrossMPT achieves this decoding performance improvement, while significantly reducing the memory usage, complexity, inference time, and training time.
arXiv Detail & Related papers (2024-05-02T06:30:52Z) - How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z) - TransCoder: Towards Unified Transferable Code Representation Learning Inspired by Human Skills [31.75121546422898]
We present TransCoder, a unified Transferable fine-tuning strategy for Code representation learning.
We employ a tunable prefix encoder as the meta-learner to capture cross-task and cross-language transferable knowledge.
Our method can lead to superior performance on various code-related tasks and encourage mutual reinforcement.
arXiv Detail & Related papers (2023-05-23T06:59:22Z) - UniASM: Binary Code Similarity Detection without Fine-tuning [0.8271859911016718]
We propose a novel transformer-based binary code embedding model named UniASM to learn representations of the binary functions.
In the real-world task of known vulnerability search, UniASM outperforms all the current baselines.
arXiv Detail & Related papers (2022-10-28T14:04:57Z) - TransRank: Self-supervised Video Representation Learning via
Ranking-based Transformation Recognition [73.7566539108205]
We observe the great potential of RecogTrans on both semantic-related and temporal-related downstream tasks.
Based on hard-label classification, existing RecogTrans approaches suffer from noisy supervision signals in pre-training.
To mitigate this problem, we developed TransRank, a unified framework for recognizing Transformations in a Ranking formulation.
arXiv Detail & Related papers (2022-05-04T12:39:25Z) - BatchFormerV2: Exploring Sample Relationships for Dense Representation
Learning [88.82371069668147]
BatchFormerV2 is a more general batch Transformer module, which enables exploring sample relationships for dense representation learning.
BatchFormerV2 consistently improves current DETR-based detection methods by over 1.3%.
arXiv Detail & Related papers (2022-04-04T05:53:42Z) - Error Correction Code Transformer [92.10654749898927]
We propose to extend for the first time the Transformer architecture to the soft decoding of linear codes at arbitrary block lengths.
We encode each channel's output dimension to high dimension for better representation of the bits information to be processed separately.
The proposed approach demonstrates the extreme power and flexibility of Transformers and outperforms existing state-of-the-art neural decoders by large margins at a fraction of their time complexity.
arXiv Detail & Related papers (2022-03-27T15:25:58Z) - TransKD: Transformer Knowledge Distillation for Efficient Semantic Segmentation [49.794142076551026]
Transformer-based Knowledge Distillation (TransKD) framework learns compact student transformers by distilling both feature maps and patch embeddings of large teacher transformers.
Experiments on Cityscapes, ACDC, NYUv2, and Pascal VOC2012 datasets show that TransKD outperforms state-of-the-art distillation frameworks.
arXiv Detail & Related papers (2022-02-27T16:34:10Z) - TransCMD: Cross-Modal Decoder Equipped with Transformer for RGB-D
Salient Object Detection [86.94578023985677]
In this work, we rethink this task from the perspective of global information alignment and transformation.
Specifically, the proposed method (TransCMD) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path.
Experimental results on seven RGB-D SOD benchmark datasets demonstrate that a simple two-stream encoder-decoder framework can surpass the state-of-the-art purely CNN-based methods.
arXiv Detail & Related papers (2021-12-04T15:45:34Z) - Relevance Transformer: Generating Concise Code Snippets with Relevance
Feedback [6.230751621285322]
We introduce and study modern Transformer architectures for explicit code generation.
We propose a new model called the Relevance Transformer that incorporates external knowledge using pseudo-relevance feedback.
The results show improvements over state-of-the-art methods based on BLEU evaluation.
arXiv Detail & Related papers (2020-07-06T09:54:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.