BinGo: Identifying Security Patches in Binary Code with Graph
Representation Learning
- URL: http://arxiv.org/abs/2312.07921v1
- Date: Wed, 13 Dec 2023 06:35:39 GMT
- Title: BinGo: Identifying Security Patches in Binary Code with Graph
Representation Learning
- Authors: Xu He, Shu Wang, Pengbin Feng, Xinda Wang, Shiyu Sun, Qi Li, Kun Sun
- Abstract summary: We propose BinGo, a new security patch detection system for binary code.
BinGo consists of four phases, namely, patch data pre-processing, graph extraction, embedding generation, and graph representation learning.
Our experimental results show BinGo can achieve up to 80.77% accuracy in identifying security patches between two neighboring versions of binary code.
- Score: 19.22004583230725
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A timely software update is vital to combat the increasing security
vulnerabilities. However, some software vendors may secretly patch their
vulnerabilities without creating CVE entries or even describing the security
issue in their change log. Thus, it is critical to identify these hidden
security patches and defeat potential N-day attacks. Researchers have employed
various machine learning techniques to identify security patches in open-source
software, leveraging the syntax and semantic features of the software changes
and commit messages. However, all these solutions cannot be directly applied to
the binary code, whose instructions and program flow may dramatically vary due
to different compilation configurations. In this paper, we propose BinGo, a new
security patch detection system for binary code. The main idea is to present
the binary code as code property graphs to enable a comprehensive understanding
of program flow and perform a language model over each basic block of binary
code to catch the instruction semantics. BinGo consists of four phases, namely,
patch data pre-processing, graph extraction, embedding generation, and graph
representation learning. Due to the lack of an existing binary security patch
dataset, we construct such a dataset by compiling the pre-patch and post-patch
source code of the Linux kernel. Our experimental results show BinGo can
achieve up to 80.77% accuracy in identifying security patches between two
neighboring versions of binary code. Moreover, BinGo can effectively reduce the
false positives and false negatives caused by the different compilers and
optimization levels.
Related papers
- Enhancing Reverse Engineering: Investigating and Benchmarking Large Language Models for Vulnerability Analysis in Decompiled Binaries [2.696054049278301]
We introduce DeBinVul, a novel decompiled binary code vulnerability dataset.
We fine-tune state-of-the-art LLMs using DeBinVul and report on a performance increase of 19%, 24%, and 21% in detecting binary code vulnerabilities.
arXiv Detail & Related papers (2024-11-07T18:54:31Z) - HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data [60.75578581719921]
Large language models (LLMs) have shown great potential for automatic code generation.
Recent studies highlight that many LLM-generated code contains serious security vulnerabilities.
We introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes.
arXiv Detail & Related papers (2024-09-10T12:01:43Z) - How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z) - FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [54.27040631527217]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries.
We first build a binary large language model (FoC-BinLLM) to summarize the semantics of cryptographic functions in natural language.
We then build a binary code similarity model (FoC-Sim) upon the FoC-BinLLM to create change-sensitive representations and use it to retrieve similar implementations of unknown cryptographic functions in a database.
arXiv Detail & Related papers (2024-03-27T09:45:33Z) - CP-BCS: Binary Code Summarization Guided by Control Flow Graph and
Pseudo Code [79.87518649544405]
We present a control flow graph and pseudo code guided binary code summarization framework called CP-BCS.
CP-BCS utilizes a bidirectional instruction-level control flow graph and pseudo code that incorporates expert knowledge to learn the comprehensive binary function execution behavior and logic semantics.
arXiv Detail & Related papers (2023-10-24T14:20:39Z) - Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts.
Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness.
Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z) - NeuDep: Neural Binary Memory Dependence Analysis [28.33030658966508]
We present a new machine-learning-based approach to predict memory dependencies by exploiting the model's learned knowledge about how binary programs execute.
We implement our approach in NeuDep and evaluate it on 41 popular software projects compiled by 2 compilers, 4 optimizations, and 4 obfuscation passes.
arXiv Detail & Related papers (2022-10-04T04:59:36Z) - SimCLF: A Simple Contrastive Learning Framework for Function-level
Binary Embeddings [2.1222884030559315]
We propose SimCLF: A Simple Contrastive Learning Framework for Function-level Binary Embeddings.
We take an unsupervised learning approach and formulate binary code similarity detection as instance discrimination.
SimCLF directly operates on disassembled binary functions and could be implemented with any encoder.
arXiv Detail & Related papers (2022-09-06T12:09:45Z) - Semantic-aware Binary Code Representation with BERT [27.908093567605484]
A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code.
Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary.
In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code.
arXiv Detail & Related papers (2021-06-10T03:31:29Z) - Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph.
It is updated by decoding in the context of an auto-encoder.
Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z) - Bin2vec: Learning Representations of Binary Executable Programs for
Security Tasks [15.780176500971244]
We introduce Bin2vec, a new approach leveraging Graph Convolutional Networks (GCN) along with computational program graphs.
We demonstrate the versatility of this approach by using our representations to solve two semantically different binary analysis tasks.
We set a new state-of-the-art result by reducing the classification error by 40% compared to the source-code-based inst2vec approach.
arXiv Detail & Related papers (2020-02-09T15:46:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.