SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly
- URL: http://arxiv.org/abs/2305.12520v3
- Date: Thu, 15 Feb 2024 15:42:02 GMT
- Title: SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly
- Authors: Jordi Armengol-Estap\'e, Jackson Woodruff, Chris Cummins, Michael F.P.
O'Boyle
- Abstract summary: This paper presents SLaDe, a Small Language model Decompiler based on a sequence-to-sequence transformer trained over real-world code.
We utilize type-inference to generate programs that are more readable and accurate than standard analytic and recent neural approaches.
- Score: 6.080751346188323
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Decompilation is a well-studied area with numerous high-quality tools
available. These are frequently used for security tasks and to port legacy
code. However, they regularly generate difficult-to-read programs and require a
large amount of engineering effort to support new programming languages and
ISAs. Recent interest in neural approaches has produced portable tools that
generate readable code. However, to-date such techniques are usually restricted
to synthetic programs without optimization, and no models have evaluated their
portability. Furthermore, while the code generated may be more readable, it is
usually incorrect. This paper presents SLaDe, a Small Language model Decompiler
based on a sequence-to-sequence transformer trained over real-world code. We
develop a novel tokenizer and exploit no-dropout training to produce
high-quality code. We utilize type-inference to generate programs that are more
readable and accurate than standard analytic and recent neural approaches.
Unlike standard approaches, SLaDe can infer out-of-context types and unlike
neural approaches, it generates correct code. We evaluate SLaDe on over 4,000
functions from ExeBench on two ISAs and at two optimizations levels. SLaDe is
up to 6 times more accurate than Ghidra, a state-of-the-art,
industrial-strength decompiler and up to 4 times more accurate than the large
language model ChatGPT and generates significantly more readable code than
both.
Related papers
- Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation.
However, their generation speed is limited by the inherently sequential nature of their decoding process.
This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - Can Large Language Models Write Parallel Code? [0.5317767988097261]
Large language models are increasingly becoming a popular tool for software development.
In this paper, we study the capabilities of state-of-the-art language models to generate parallel code.
arXiv Detail & Related papers (2024-01-23T08:25:12Z) - Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts.
Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness.
Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z) - COMEX: A Tool for Generating Customized Source Code Representations [7.151800146054561]
COMEX is a framework that allows researchers and developers to create and combine multiple code-views.
It can analyze both method-level snippets and program-level snippets by using both intra-procedural and inter-procedural snippets.
It is built on tree-sitter - a widely used incremental analysis tool that supports over 40 languages.
arXiv Detail & Related papers (2023-07-10T16:46:34Z) - Planning with Large Language Models for Code Generation [100.07232672883897]
Planning-Guided Transformer Decoding (PG-TD) uses a planning algorithm to do lookahead search and guide the Transformer to generate better programs.
We empirically evaluate our framework with several large language models as backbones on public coding challenge benchmarks.
arXiv Detail & Related papers (2023-03-09T18:59:47Z) - Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it.
Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.