Related papers: kTrans: Knowledge-Aware Transformer for Binary Code Embedding

kTrans: Knowledge-Aware Transformer for Binary Code Embedding

URL: http://arxiv.org/abs/2308.12659v1
Date: Thu, 24 Aug 2023 09:07:11 GMT
Title: kTrans: Knowledge-Aware Transformer for Binary Code Embedding
Authors: Wenyu Zhu, Hao Wang, Yuchen Zhou, Jiaming Wang, Zihan Sha, Zeyu Gao, Chao Zhang
Abstract summary: We propose a novel Transformer-based approach, namely kTrans, to generate knowledge-aware binary code embedding. We inspect the generated embeddings with outlier detection and visualization, and also apply kTrans to 3 downstream tasks: Binary Code Similarity Detection (BCSD), Function Type Recovery (FTR) and Indirect Call Recognition (ICR) Evaluation results show that kTrans can generate high-quality binary code embeddings, and outperforms state-of-the-art (SOTA) approaches on downstream tasks by 5.2%, 6.8%, and 12.6% respectively.
Score: 15.361622199889263
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Binary Code Embedding (BCE) has important applications in various reverse engineering tasks such as binary code similarity detection, type recovery, control-flow recovery and data-flow analysis. Recent studies have shown that the Transformer model can comprehend the semantics of binary code to support downstream tasks. However, existing models overlooked the prior knowledge of assembly language. In this paper, we propose a novel Transformer-based approach, namely kTrans, to generate knowledge-aware binary code embedding. By feeding explicit knowledge as additional inputs to the Transformer, and fusing implicit knowledge with a novel pre-training task, kTrans provides a new perspective to incorporating domain knowledge into a Transformer framework. We inspect the generated embeddings with outlier detection and visualization, and also apply kTrans to 3 downstream tasks: Binary Code Similarity Detection (BCSD), Function Type Recovery (FTR) and Indirect Call Recognition (ICR). Evaluation results show that kTrans can generate high-quality binary code embeddings, and outperforms state-of-the-art (SOTA) approaches on downstream tasks by 5.2%, 6.8%, and 12.6% respectively. kTrans is publicly available at: https://github.com/Learner0x5a/kTrans-release

Related papers

An Empirical Study on the Effectiveness of Large Language Models for Binary Code Understanding [50.17907898478795]
This work proposes a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in real-world reverse engineering scenarios. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2025-04-30T17:02:06Z)
Beyond the Edge of Function: Unraveling the Patterns of Type Recovery in Binary Code [55.493408628371235]
We propose ByteTR, a framework for recovering variable types in binary code. In light of the ubiquity of variable propagation across functions, ByteTR conducts inter-procedural analysis to trace variable propagation and employs a gated graph neural network to capture long-range data flow dependencies for variable type recovery.
arXiv Detail & Related papers (2025-03-10T12:27:05Z)
A Progressive Transformer for Unifying Binary Code Embedding and Knowledge Transfer [15.689556592544667]
We introduce ProTST, a novel transformer-based methodology for binary code embedding. ProTST employs a hierarchical training process based on a unique tree-like structure. Results show that ProTST yields an average validation score (F1, MRR, and Recall@1) improvement of 14.8% compared to traditional two-stage training.
arXiv Detail & Related papers (2024-12-15T13:04:29Z)
CrossMPT: Cross-attention Message-Passing Transformer for Error Correcting Codes [14.631435001491514]
We propose a novel Cross-attention Message-Passing Transformer (CrossMPT) We show that CrossMPT significantly outperforms existing neural network-based decoders for various code classes. Notably, CrossMPT achieves this decoding performance improvement, while significantly reducing the memory usage, complexity, inference time, and training time.
arXiv Detail & Related papers (2024-05-02T06:30:52Z)
How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z)
TransCoder: Towards Unified Transferable Code Representation Learning Inspired by Human Skills [31.75121546422898]
We present TransCoder, a unified Transferable fine-tuning strategy for Code representation learning. We employ a tunable prefix encoder as the meta-learner to capture cross-task and cross-language transferable knowledge. Our method can lead to superior performance on various code-related tasks and encourage mutual reinforcement.
arXiv Detail & Related papers (2023-05-23T06:59:22Z)
UniASM: Binary Code Similarity Detection without Fine-tuning [0.8271859911016718]
We propose a novel transformer-based binary code embedding model named UniASM to learn representations of the binary functions. In the real-world task of known vulnerability search, UniASM outperforms all the current baselines.
arXiv Detail & Related papers (2022-10-28T14:04:57Z)
A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities [54.039855851891815]
Transformer-based models have demonstrated state-of-the-art performance in many intelligent coding tasks.<n>We empirically study the effect of semantic-preserving code transformation on the performance of Transformer.
arXiv Detail & Related papers (2022-07-09T15:02:39Z)
TransRank: Self-supervised Video Representation Learning via Ranking-based Transformation Recognition [73.7566539108205]
We observe the great potential of RecogTrans on both semantic-related and temporal-related downstream tasks. Based on hard-label classification, existing RecogTrans approaches suffer from noisy supervision signals in pre-training. To mitigate this problem, we developed TransRank, a unified framework for recognizing Transformations in a Ranking formulation.
arXiv Detail & Related papers (2022-05-04T12:39:25Z)
BatchFormerV2: Exploring Sample Relationships for Dense Representation Learning [88.82371069668147]
BatchFormerV2 is a more general batch Transformer module, which enables exploring sample relationships for dense representation learning. BatchFormerV2 consistently improves current DETR-based detection methods by over 1.3%.
arXiv Detail & Related papers (2022-04-04T05:53:42Z)
Error Correction Code Transformer [92.10654749898927]
We propose to extend for the first time the Transformer architecture to the soft decoding of linear codes at arbitrary block lengths. We encode each channel's output dimension to high dimension for better representation of the bits information to be processed separately. The proposed approach demonstrates the extreme power and flexibility of Transformers and outperforms existing state-of-the-art neural decoders by large margins at a fraction of their time complexity.
arXiv Detail & Related papers (2022-03-27T15:25:58Z)
TransKD: Transformer Knowledge Distillation for Efficient Semantic Segmentation [49.794142076551026]
Transformer-based Knowledge Distillation (TransKD) framework learns compact student transformers by distilling both feature maps and patch embeddings of large teacher transformers. Experiments on Cityscapes, ACDC, NYUv2, and Pascal VOC2012 datasets show that TransKD outperforms state-of-the-art distillation frameworks.
arXiv Detail & Related papers (2022-02-27T16:34:10Z)
TransCMD: Cross-Modal Decoder Equipped with Transformer for RGB-D Salient Object Detection [86.94578023985677]
In this work, we rethink this task from the perspective of global information alignment and transformation. Specifically, the proposed method (TransCMD) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path. Experimental results on seven RGB-D SOD benchmark datasets demonstrate that a simple two-stream encoder-decoder framework can surpass the state-of-the-art purely CNN-based methods.
arXiv Detail & Related papers (2021-12-04T15:45:34Z)
Relevance Transformer: Generating Concise Code Snippets with Relevance Feedback [6.230751621285322]
We introduce and study modern Transformer architectures for explicit code generation. We propose a new model called the Relevance Transformer that incorporates external knowledge using pseudo-relevance feedback. The results show improvements over state-of-the-art methods based on BLEU evaluation.
arXiv Detail & Related papers (2020-07-06T09:54:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.