Idioms: Neural Decompilation With Joint Code and Type Definition Prediction
- URL: http://arxiv.org/abs/2502.04536v2
- Date: Tue, 17 Jun 2025 16:39:31 GMT
- Title: Idioms: Neural Decompilation With Joint Code and Type Definition Prediction
- Authors: Luke Dramko, Claire Le Goues, Edward J. Schwartz,
- Abstract summary: We introduce a new dataset, Realtype, that includes substantially more complicated and realistic types than existing neural decompilation benchmarks.<n>We show that our approach yields state-of-the-art results in neural decompilation.
- Score: 7.421408987075001
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Decompilers are important tools for reverse engineers that help them analyze software at a higher level of abstraction than assembly code. Unfortunately, because compilation is lossy, deterministic decompilers produce code that is missing many of the details that make source code readable in the first place, like variable names and types. Neural decompilers, on the other hand, offer the ability to statistically fill in these details. Existing work in neural decompilation, however, suffers from substantial limitations that preclude its use on real code, such as the inability to define composite types, which is essential to fully specify function semantics. In this work, we introduce a new dataset, Realtype, that includes substantially more complicated and realistic types than existing neural decompilation benchmarks, and Idioms, a new neural decompilation approach to finetune any LLM into a neural decompiler capable of generating the appropriate user-defined type definitions alongside the decompiled code. We show that our approach yields state-of-the-art results in neural decompilation. On the most challenging existing benchmark, ExeBench, our model achieves 54.4% accuracy vs. 46.3% for LLM4Decompile and 37.5% for Nova; on Realtype, our model performs at least 95% better.
Related papers
- Is Compression Really Linear with Code Intelligence? [60.123628177110206]
textitFormat Annealing is a lightweight, transparent training methodology designed to assess the intrinsic capabilities of pre-trained models equitably.<n>Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and bits-per-character (BPC)<n>Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.
arXiv Detail & Related papers (2025-05-16T16:59:14Z) - ReF Decompile: Relabeling and Function Call Enhanced Decompile [50.86228893636785]
The goal of decompilation is to convert compiled low-level code (e.g., assembly code) back into high-level programming languages.
This task supports various reverse engineering applications, such as vulnerability identification, malware analysis, and legacy software migration.
arXiv Detail & Related papers (2025-02-17T12:38:57Z) - Fast, Fine-Grained Equivalence Checking for Neural Decompilers [7.421408987075001]
We introduce codealign, a novel instruction-level code equivalence technique designed for neural decompilers.<n>We show how codealign generates equivalence alignments, then evaluate codealign by comparing it with symbolic execution.
arXiv Detail & Related papers (2025-01-08T19:59:48Z) - A Library for Learning Neural Operators [75.14579433742178]
We present NeuralOperator, an open-source Python library for operator learning.
Neural operators generalize neural networks to maps between function spaces instead of finite-dimensional Euclidean spaces.
Built on top of PyTorch, NeuralOperator provides all the tools for training and deploying neural operator models.
arXiv Detail & Related papers (2024-12-13T18:49:37Z) - Training Neural Networks as Recognizers of Formal Languages [87.06906286950438]
Formal language theory pertains specifically to recognizers.
It is common to instead use proxy tasks that are similar in only an informal sense.
We correct this mismatch by training and evaluating neural networks directly as binary classifiers of strings.
arXiv Detail & Related papers (2024-11-11T16:33:25Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts.
Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness.
Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z) - SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly [6.080751346188323]
This paper presents SLaDe, a Small Language model Decompiler based on a sequence-to-sequence transformer trained over real-world code.
We utilize type-inference to generate programs that are more readable and accurate than standard analytic and recent neural approaches.
arXiv Detail & Related papers (2023-05-21T17:31:39Z) - Neuro-Symbolic Execution of Generic Source Code [6.47243430672461]
We introduce Neural Interpretation (NI), the first neural model for the execution of generic source code that allows missing definitions.
NI is a novel neural model of computers with a compiler architecture that can assemble neural layers "programmed" by source code.
arXiv Detail & Related papers (2023-03-23T17:56:45Z) - Boosting Neural Networks to Decompile Optimized Binaries [13.255618541522436]
Decompilation aims to transform a low-level program language (LPL) into its functionally-equivalent high-level program language (HPL)
We propose a novel learning-based approach named NeurDP, that targets compiler-optimized binaries.
arXiv Detail & Related papers (2023-01-03T06:45:54Z) - Beyond the C: Retargetable Decompilation using Neural Machine
Translation [5.734661402742406]
We develop a prototype decompiler that is easily retargetable to new languages.
We examine the impact of parameters such as tokenization and training data selection on the quality of decompilation.
We will release our training data, trained decompilation models, and code to help encourage future research into language-agnostic decompilation.
arXiv Detail & Related papers (2022-12-17T20:45:59Z) - Learning C to x86 Translation: An Experiment in Neural Compilation [3.997680012976965]
Code-to-code neural models have been used in code translation, code refinement and decompilation.
In this work, we explore neural compilation, building and evaluating Transformer models that learn how to produce x86 assembler from C code.
arXiv Detail & Related papers (2021-08-17T14:11:15Z) - Improving type information inferred by decompilers with supervised
machine learning [0.0]
In software reverse engineering, decompilation is the process of recovering source code from binary files.
We build different classification models capable of inferring the high-level type returned by functions.
Our system is able to predict function return types with a 79.1% F1-measure, whereas the best decompiler obtains a 30% F1-measure.
arXiv Detail & Related papers (2021-01-19T11:45:46Z) - Representing Partial Programs with Blended Abstract Semantics [62.20775388513027]
We introduce a technique for representing partially written programs in a program synthesis engine.
We learn an approximate execution model implemented as a modular neural network.
We show that these hybrid neuro-symbolic representations enable execution-guided synthesizers to use more powerful language constructs.
arXiv Detail & Related papers (2020-12-23T20:40:18Z) - Neurocoder: Learning General-Purpose Computation Using Stored Neural
Programs [64.56890245622822]
Neurocoder is an entirely new class of general-purpose conditional computational machines.
It "codes" itself in a data-responsive way by composing relevant programs from a set of shareable, modular programs.
We show new capacity to learn modular programs, handle severe pattern shifts and remember old programs as new ones are learnt.
arXiv Detail & Related papers (2020-09-24T01:39:16Z) - Contrastive Code Representation Learning [95.86686147053958]
We show that the popular reconstruction-based BERT model is sensitive to source code edits, even when the edits preserve semantics.
We propose ContraCode: a contrastive pre-training task that learns code functionality, not form.
arXiv Detail & Related papers (2020-07-09T17:59:06Z) - PolyDL: Polyhedral Optimizations for Creation of High Performance DL
primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives.
We develop novel data reuse analysis algorithms using the polyhedral model.
We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.