Related papers: BinGo: Identifying Security Patches in Binary Code with Graph Representation Learning

BinGo: Identifying Security Patches in Binary Code with Graph Representation Learning

URL: http://arxiv.org/abs/2312.07921v1
Date: Wed, 13 Dec 2023 06:35:39 GMT
Title: BinGo: Identifying Security Patches in Binary Code with Graph Representation Learning
Authors: Xu He, Shu Wang, Pengbin Feng, Xinda Wang, Shiyu Sun, Qi Li, Kun Sun
Abstract summary: We propose BinGo, a new security patch detection system for binary code. BinGo consists of four phases, namely, patch data pre-processing, graph extraction, embedding generation, and graph representation learning. Our experimental results show BinGo can achieve up to 80.77% accuracy in identifying security patches between two neighboring versions of binary code.
Score: 19.22004583230725
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A timely software update is vital to combat the increasing security vulnerabilities. However, some software vendors may secretly patch their vulnerabilities without creating CVE entries or even describing the security issue in their change log. Thus, it is critical to identify these hidden security patches and defeat potential N-day attacks. Researchers have employed various machine learning techniques to identify security patches in open-source software, leveraging the syntax and semantic features of the software changes and commit messages. However, all these solutions cannot be directly applied to the binary code, whose instructions and program flow may dramatically vary due to different compilation configurations. In this paper, we propose BinGo, a new security patch detection system for binary code. The main idea is to present the binary code as code property graphs to enable a comprehensive understanding of program flow and perform a language model over each basic block of binary code to catch the instruction semantics. BinGo consists of four phases, namely, patch data pre-processing, graph extraction, embedding generation, and graph representation learning. Due to the lack of an existing binary security patch dataset, we construct such a dataset by compiling the pre-patch and post-patch source code of the Linux kernel. Our experimental results show BinGo can achieve up to 80.77% accuracy in identifying security patches between two neighboring versions of binary code. Moreover, BinGo can effectively reduce the false positives and false negatives caused by the different compilers and optimization levels.

Related papers

An Empirical Study on the Effectiveness of Large Language Models for Binary Code Understanding [50.17907898478795]
This work proposes a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in real-world reverse engineering scenarios. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2025-04-30T17:02:06Z)
ReF Decompile: Relabeling and Function Call Enhanced Decompile [50.86228893636785]
The goal of decompilation is to convert compiled low-level code (e.g., assembly code) back into high-level programming languages. This task supports various reverse engineering applications, such as vulnerability identification, malware analysis, and legacy software migration.
arXiv Detail & Related papers (2025-02-17T12:38:57Z)
Can Neural Decompilation Assist Vulnerability Prediction on Binary Code? [0.0]
This paper presents an experimental study to predict vulnerabilities in binary code without source code or complex representations of the binary. The results outperform the state-of-the-art in both neural decompilation and vulnerability prediction.
arXiv Detail & Related papers (2024-12-10T14:17:14Z)
Enhancing Reverse Engineering: Investigating and Benchmarking Large Language Models for Vulnerability Analysis in Decompiled Binaries [2.696054049278301]
We introduce DeBinVul, a novel decompiled binary code vulnerability dataset. We fine-tune state-of-the-art LLMs using DeBinVul and report on a performance increase of 19%, 24%, and 21% in detecting binary code vulnerabilities.
arXiv Detail & Related papers (2024-11-07T18:54:31Z)
HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data [60.75578581719921]
Large language models (LLMs) have shown great potential for automatic code generation. Recent studies highlight that many LLM-generated code contains serious security vulnerabilities. We introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes.
arXiv Detail & Related papers (2024-09-10T12:01:43Z)
How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z)
FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [54.27040631527217]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries. We first build a binary large language model (FoC-BinLLM) to summarize the semantics of cryptographic functions in natural language. We then build a binary code similarity model (FoC-Sim) upon the FoC-BinLLM to create change-sensitive representations and use it to retrieve similar implementations of unknown cryptographic functions in a database.
arXiv Detail & Related papers (2024-03-27T09:45:33Z)
CP-BCS: Binary Code Summarization Guided by Control Flow Graph and Pseudo Code [79.87518649544405]
We present a control flow graph and pseudo code guided binary code summarization framework called CP-BCS. CP-BCS utilizes a bidirectional instruction-level control flow graph and pseudo code that incorporates expert knowledge to learn the comprehensive binary function execution behavior and logic semantics.
arXiv Detail & Related papers (2023-10-24T14:20:39Z)
Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z)
NeuDep: Neural Binary Memory Dependence Analysis [28.33030658966508]
We present a new machine-learning-based approach to predict memory dependencies by exploiting the model's learned knowledge about how binary programs execute. We implement our approach in NeuDep and evaluate it on 41 popular software projects compiled by 2 compilers, 4 optimizations, and 4 obfuscation passes.
arXiv Detail & Related papers (2022-10-04T04:59:36Z)
SimCLF: A Simple Contrastive Learning Framework for Function-level Binary Embeddings [2.1222884030559315]
We propose SimCLF: A Simple Contrastive Learning Framework for Function-level Binary Embeddings. We take an unsupervised learning approach and formulate binary code similarity detection as instance discrimination. SimCLF directly operates on disassembled binary functions and could be implemented with any encoder.
arXiv Detail & Related papers (2022-09-06T12:09:45Z)
Semantic-aware Binary Code Representation with BERT [27.908093567605484]
A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code. Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary. In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code.
arXiv Detail & Related papers (2021-06-10T03:31:29Z)
Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph. It is updated by decoding in the context of an auto-encoder. Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)
Bin2vec: Learning Representations of Binary Executable Programs for Security Tasks [15.780176500971244]
We introduce Bin2vec, a new approach leveraging Graph Convolutional Networks (GCN) along with computational program graphs. We demonstrate the versatility of this approach by using our representations to solve two semantically different binary analysis tasks. We set a new state-of-the-art result by reducing the classification error by 40% compared to the source-code-based inst2vec approach.
arXiv Detail & Related papers (2020-02-09T15:46:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.