Idioms: Neural Decompilation With Joint Code and Type Prediction
- URL: http://arxiv.org/abs/2502.04536v1
- Date: Thu, 06 Feb 2025 22:13:40 GMT
- Title: Idioms: Neural Decompilation With Joint Code and Type Prediction
- Authors: Luke Dramko, Claire Le Goues, Edward J. Schwartz,
- Abstract summary: We introduce a new training process to finetune any LLM into a neural decompiler capable of generating the appropriate user-defined types alongside the decompilation.
Motivated by the intuition that different parts of data structures can be operated upon by different parts of the program, we show that interprocedural context can help improve neural decompilers' ability to handle user-defined types.
- Score: 7.421408987075001
- License:
- Abstract: Decompilers are important tools for reverse engineers that help them analyze software at a higher level of abstraction than assembly. Unfortunately, because compilation is lossy, deterministic decompilers produce code that is missing many of the details that make source code readable in the first place, like variable names and types. Neural decompilers, on the other hand, offer the ability to statistically fill in these details. Existing work in neural decompilation, however, suffers from substantial drawbacks that limits its ability to handle real code: it is unable to handle user-defined composite types, which are essential to fully specifying many functions' semantics, or require test cases. In this work, we introduce a new training process to finetune any LLM into a neural decompiler capable of generating the appropriate user-defined types alongside the decompilation. We introduce a new dataset, Realtype, that includes substantially more complicated and realistic types than existing neural decompilation benchmarks. Motivated by the intuition that different parts of data structures can be operated upon by different parts of the program, we show that interprocedural context can help improve neural decompilers' ability to handle user-defined types. We show that our training process yields state-of-the-art results in neural decompilation. We also publicly release the Idioms series of finetuned neural decompilation models in support of open science. In summary, we identify the need for joint code and type prediction, show that it is a hard problem, and take the first steps towards solving it.
Related papers
- ReF Decompile: Relabeling and Function Call Enhanced Decompile [50.86228893636785]
The goal of decompilation is to convert compiled low-level code (e.g., assembly code) back into high-level programming languages.
This task supports various reverse engineering applications, such as vulnerability identification, malware analysis, and legacy software migration.
arXiv Detail & Related papers (2025-02-17T12:38:57Z) - Fast, Fine-Grained Equivalence Checking for Neural Decompilers [7.421408987075001]
We introduce codealign, a novel instruction-level code equivalence technique designed for neural decompilers.
We show how codealign generates equivalence alignments, then evaluate codealign by comparing it with symbolic execution.
arXiv Detail & Related papers (2025-01-08T19:59:48Z) - Training Neural Networks as Recognizers of Formal Languages [87.06906286950438]
Formal language theory pertains specifically to recognizers.
It is common to instead use proxy tasks that are similar in only an informal sense.
We correct this mismatch by training and evaluating neural networks directly as binary classifiers of strings.
arXiv Detail & Related papers (2024-11-11T16:33:25Z) - Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts.
Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness.
Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z) - Neuro-Symbolic Execution of Generic Source Code [6.47243430672461]
We introduce Neural Interpretation (NI), the first neural model for the execution of generic source code that allows missing definitions.
NI is a novel neural model of computers with a compiler architecture that can assemble neural layers "programmed" by source code.
arXiv Detail & Related papers (2023-03-23T17:56:45Z) - Boosting Neural Networks to Decompile Optimized Binaries [13.255618541522436]
Decompilation aims to transform a low-level program language (LPL) into its functionally-equivalent high-level program language (HPL)
We propose a novel learning-based approach named NeurDP, that targets compiler-optimized binaries.
arXiv Detail & Related papers (2023-01-03T06:45:54Z) - Beyond the C: Retargetable Decompilation using Neural Machine
Translation [5.734661402742406]
We develop a prototype decompiler that is easily retargetable to new languages.
We examine the impact of parameters such as tokenization and training data selection on the quality of decompilation.
We will release our training data, trained decompilation models, and code to help encourage future research into language-agnostic decompilation.
arXiv Detail & Related papers (2022-12-17T20:45:59Z) - Learning C to x86 Translation: An Experiment in Neural Compilation [3.997680012976965]
Code-to-code neural models have been used in code translation, code refinement and decompilation.
In this work, we explore neural compilation, building and evaluating Transformer models that learn how to produce x86 assembler from C code.
arXiv Detail & Related papers (2021-08-17T14:11:15Z) - Representing Partial Programs with Blended Abstract Semantics [62.20775388513027]
We introduce a technique for representing partially written programs in a program synthesis engine.
We learn an approximate execution model implemented as a modular neural network.
We show that these hybrid neuro-symbolic representations enable execution-guided synthesizers to use more powerful language constructs.
arXiv Detail & Related papers (2020-12-23T20:40:18Z) - Neurocoder: Learning General-Purpose Computation Using Stored Neural
Programs [64.56890245622822]
Neurocoder is an entirely new class of general-purpose conditional computational machines.
It "codes" itself in a data-responsive way by composing relevant programs from a set of shareable, modular programs.
We show new capacity to learn modular programs, handle severe pattern shifts and remember old programs as new ones are learnt.
arXiv Detail & Related papers (2020-09-24T01:39:16Z) - PolyDL: Polyhedral Optimizations for Creation of High Performance DL
primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives.
We develop novel data reuse analysis algorithms using the polyhedral model.
We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.