Extending Source Code Pre-Trained Language Models to Summarise
Decompiled Binaries
- URL: http://arxiv.org/abs/2301.01701v1
- Date: Wed, 4 Jan 2023 16:56:33 GMT
- Title: Extending Source Code Pre-Trained Language Models to Summarise
Decompiled Binaries
- Authors: Ali Al-Kaswan, Toufique Ahmed, Maliheh Izadi, Anand Ashok Sawant, Prem
Devanbu, Arie van Deursen
- Abstract summary: We extend large pre-trained language models of source code to summarise decompiled binary functions.
We investigate the impact of input and data properties on the performance of such models.
BinT5 achieves the state-of-the-art BLEU-4 score of 60.83, 58.82, and 44.21 for summarising source, decompiled, and synthetically stripped decompiled code.
- Score: 4.0484792045035505
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reverse engineering binaries is required to understand and analyse programs
for which the source code is unavailable. Decompilers can transform the largely
unreadable binaries into a more readable source code-like representation.
However, reverse engineering is time-consuming, much of which is taken up by
labelling the functions with semantic information.
While the automated summarisation of decompiled code can help Reverse
Engineers understand and analyse binaries, current work mainly focuses on
summarising source code, and no suitable dataset exists for this task.
In this work, we extend large pre-trained language models of source code to
summarise decompiled binary functions. Furthermore, we investigate the impact
of input and data properties on the performance of such models. Our approach
consists of two main components; the data and the model.
We first build CAPYBARA, a dataset of 214K decompiled function-documentation
pairs across various compiler optimisations. We extend CAPYBARA further by
generating synthetic datasets and deduplicating the data.
Next, we fine-tune the CodeT5 base model with CAPYBARA to create BinT5. BinT5
achieves the state-of-the-art BLEU-4 score of 60.83, 58.82, and 44.21 for
summarising source, decompiled, and synthetically stripped decompiled code,
respectively. This indicates that these models can be extended to decompiled
binaries successfully.
Finally, we found that the performance of BinT5 is not heavily dependent on
the dataset size and compiler optimisation level. We recommend future research
to further investigate transferring knowledge when working with less expressive
input formats such as stripped binaries.
Related papers
- STRIDE: Simple Type Recognition In Decompiled Executables [16.767295743254458]
We propose STRIDE, a technique that predicts variable names and types by matching sequences of decompiler tokens to those found in training data.
We evaluate it on three benchmark datasets and find that STRIDE achieves comparable performance to state-of-the-art machine learning models for both variable retyping and renaming.
arXiv Detail & Related papers (2024-07-03T01:09:41Z) - How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z) - FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [54.27040631527217]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries.
FoC-BinLLM outperforms ChatGPT by 14.61% on the ROUGE-L score.
FoC-Sim outperforms the previous best methods with a 52% higher Recall@1.
arXiv Detail & Related papers (2024-03-27T09:45:33Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence.
Our library supports a collection of pretrained Code LLM models and popular code benchmarks.
We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z) - Revisiting Deep Learning for Variable Type Recovery [3.075963833361584]
DIRTY is a Transformer-based-Decoder architecture capable of augmenting decompiled code with variable names and types.
We extend the original DIRTY results by re-training the DIRTY model on a dataset produced by the open-source Ghidra decompiler.
arXiv Detail & Related papers (2023-04-07T22:28:28Z) - Boosting Neural Networks to Decompile Optimized Binaries [13.255618541522436]
Decompilation aims to transform a low-level program language (LPL) into its functionally-equivalent high-level program language (HPL)
We propose a novel learning-based approach named NeurDP, that targets compiler-optimized binaries.
arXiv Detail & Related papers (2023-01-03T06:45:54Z) - Pre-Training Representations of Binary Code Using Contrastive Learning [13.570375923483452]
We propose a COntrastive learning Model for Binary cOde Analysis, or COMBO, that incorporates source code and comment information into binary code during representation learning.
COMBO is the first language representation model that incorporates source code, binary code, and comments into contrastive code representation learning.
arXiv Detail & Related papers (2022-10-11T02:39:06Z) - Improving type information inferred by decompilers with supervised
machine learning [0.0]
In software reverse engineering, decompilation is the process of recovering source code from binary files.
We build different classification models capable of inferring the high-level type returned by functions.
Our system is able to predict function return types with a 79.1% F1-measure, whereas the best decompiler obtains a 30% F1-measure.
arXiv Detail & Related papers (2021-01-19T11:45:46Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z) - Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph.
It is updated by decoding in the context of an auto-encoder.
Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.