CP-BCS: Binary Code Summarization Guided by Control Flow Graph and
Pseudo Code
- URL: http://arxiv.org/abs/2310.16853v1
- Date: Tue, 24 Oct 2023 14:20:39 GMT
- Title: CP-BCS: Binary Code Summarization Guided by Control Flow Graph and
Pseudo Code
- Authors: Tong Ye, Lingfei Wu, Tengfei Ma, Xuhong Zhang, Yangkai Du, Peiyu Liu,
Shouling Ji, Wenhai Wang
- Abstract summary: We present a control flow graph and pseudo code guided binary code summarization framework called CP-BCS.
CP-BCS utilizes a bidirectional instruction-level control flow graph and pseudo code that incorporates expert knowledge to learn the comprehensive binary function execution behavior and logic semantics.
- Score: 79.87518649544405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatically generating function summaries for binaries is an extremely
valuable but challenging task, since it involves translating the execution
behavior and semantics of the low-level language (assembly code) into
human-readable natural language. However, most current works on understanding
assembly code are oriented towards generating function names, which involve
numerous abbreviations that make them still confusing. To bridge this gap, we
focus on generating complete summaries for binary functions, especially for
stripped binary (no symbol table and debug information in reality). To fully
exploit the semantics of assembly code, we present a control flow graph and
pseudo code guided binary code summarization framework called CP-BCS. CP-BCS
utilizes a bidirectional instruction-level control flow graph and pseudo code
that incorporates expert knowledge to learn the comprehensive binary function
execution behavior and logic semantics. We evaluate CP-BCS on 3 different
binary optimization levels (O1, O2, and O3) for 3 different computer
architectures (X86, X64, and ARM). The evaluation results demonstrate CP-BCS is
superior and significantly improves the efficiency of reverse engineering.
Related papers
- Geometric structure and transversal logic of quantum Reed-Muller codes [51.11215560140181]
In this paper, we aim to characterize the gates of quantum Reed-Muller (RM) codes by exploiting the well-studied properties of their classical counterparts.
A set of stabilizer generators for a RM code can be described via $X$ and $Z$ operators acting on subcubes of particular dimensions.
arXiv Detail & Related papers (2024-10-10T04:07:24Z) - How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z) - FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [54.27040631527217]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries.
We first build a binary large language model (FoC-BinLLM) to summarize the semantics of cryptographic functions in natural language.
We then build a binary code similarity model (FoC-Sim) upon the FoC-BinLLM to create change-sensitive representations and use it to retrieve similar implementations of unknown cryptographic functions in a database.
arXiv Detail & Related papers (2024-03-27T09:45:33Z) - CLAP: Learning Transferable Binary Code Representations with Natural
Language Supervision [22.42846252594693]
We present CLAP (Contrastive Language-Assembly Pre-training), which employs natural language supervision to learn better representations of binary code.
At the core, our approach boosts superior transfer learning capabilities by effectively aligning binary code with their semantics explanations.
We have generated 195 million pairs of binary code and explanations and trained a prototype of CLAP.
arXiv Detail & Related papers (2024-02-26T13:49:52Z) - Pre-Training Representations of Binary Code Using Contrastive Learning [13.570375923483452]
We propose a COntrastive learning Model for Binary cOde Analysis, or COMBO, that incorporates source code and comment information into binary code during representation learning.
COMBO is the first language representation model that incorporates source code, binary code, and comments into contrastive code representation learning.
arXiv Detail & Related papers (2022-10-11T02:39:06Z) - Semantic-aware Binary Code Representation with BERT [27.908093567605484]
A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code.
Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary.
In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code.
arXiv Detail & Related papers (2021-06-10T03:31:29Z) - PalmTree: Learning an Assembly Language Model for Instruction Embedding [8.74990895782223]
We propose to pre-train an assembly language model called PalmTree for generating general-purpose instruction embeddings.
PalmTree has the best performance for intrinsic metrics, and outperforms the other instruction embedding schemes for all downstream tasks.
arXiv Detail & Related papers (2021-01-21T22:30:01Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z) - Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph.
It is updated by decoding in the context of an auto-encoder.
Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.