Related papers: Idioms: Neural Decompilation With Joint Code and Type Prediction

Idioms: Neural Decompilation With Joint Code and Type Prediction

URL: http://arxiv.org/abs/2502.04536v1
Date: Thu, 06 Feb 2025 22:13:40 GMT
Title: Idioms: Neural Decompilation With Joint Code and Type Prediction
Authors: Luke Dramko, Claire Le Goues, Edward J. Schwartz,
Abstract summary: We introduce a new training process to finetune any LLM into a neural decompiler capable of generating the appropriate user-defined types alongside the decompilation.<n>Motivated by the intuition that different parts of data structures can be operated upon by different parts of the program, we show that interprocedural context can help improve neural decompilers' ability to handle user-defined types.
Score: 7.421408987075001
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Decompilers are important tools for reverse engineers that help them analyze software at a higher level of abstraction than assembly. Unfortunately, because compilation is lossy, deterministic decompilers produce code that is missing many of the details that make source code readable in the first place, like variable names and types. Neural decompilers, on the other hand, offer the ability to statistically fill in these details. Existing work in neural decompilation, however, suffers from substantial drawbacks that limits its ability to handle real code: it is unable to handle user-defined composite types, which are essential to fully specifying many functions' semantics, or require test cases. In this work, we introduce a new training process to finetune any LLM into a neural decompiler capable of generating the appropriate user-defined types alongside the decompilation. We introduce a new dataset, Realtype, that includes substantially more complicated and realistic types than existing neural decompilation benchmarks. Motivated by the intuition that different parts of data structures can be operated upon by different parts of the program, we show that interprocedural context can help improve neural decompilers' ability to handle user-defined types. We show that our training process yields state-of-the-art results in neural decompilation. We also publicly release the Idioms series of finetuned neural decompilation models in support of open science. In summary, we identify the need for joint code and type prediction, show that it is a hard problem, and take the first steps towards solving it.

Related papers

ReF Decompile: Relabeling and Function Call Enhanced Decompile [50.86228893636785]
The goal of decompilation is to convert compiled low-level code (e.g., assembly code) back into high-level programming languages. This task supports various reverse engineering applications, such as vulnerability identification, malware analysis, and legacy software migration.
arXiv Detail & Related papers (2025-02-17T12:38:57Z)
Fast, Fine-Grained Equivalence Checking for Neural Decompilers [7.421408987075001]
We introduce codealign, a novel instruction-level code equivalence technique designed for neural decompilers.<n>We show how codealign generates equivalence alignments, then evaluate codealign by comparing it with symbolic execution.
arXiv Detail & Related papers (2025-01-08T19:59:48Z)
A Library for Learning Neural Operators [75.14579433742178]
We present NeuralOperator, an open-source Python library for operator learning. Neural operators generalize neural networks to maps between function spaces instead of finite-dimensional Euclidean spaces. Built on top of PyTorch, NeuralOperator provides all the tools for training and deploying neural operator models.
arXiv Detail & Related papers (2024-12-13T18:49:37Z)
Training Neural Networks as Recognizers of Formal Languages [87.06906286950438]
Formal language theory pertains specifically to recognizers. It is common to instead use proxy tasks that are similar in only an informal sense. We correct this mismatch by training and evaluating neural networks directly as binary classifiers of strings.
arXiv Detail & Related papers (2024-11-11T16:33:25Z)
Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z)
Neuro-Symbolic Execution of Generic Source Code [6.47243430672461]
We introduce Neural Interpretation (NI), the first neural model for the execution of generic source code that allows missing definitions. NI is a novel neural model of computers with a compiler architecture that can assemble neural layers "programmed" by source code.
arXiv Detail & Related papers (2023-03-23T17:56:45Z)
Boosting Neural Networks to Decompile Optimized Binaries [13.255618541522436]
Decompilation aims to transform a low-level program language (LPL) into its functionally-equivalent high-level program language (HPL) We propose a novel learning-based approach named NeurDP, that targets compiler-optimized binaries.
arXiv Detail & Related papers (2023-01-03T06:45:54Z)
Beyond the C: Retargetable Decompilation using Neural Machine Translation [5.734661402742406]
We develop a prototype decompiler that is easily retargetable to new languages. We examine the impact of parameters such as tokenization and training data selection on the quality of decompilation. We will release our training data, trained decompilation models, and code to help encourage future research into language-agnostic decompilation.
arXiv Detail & Related papers (2022-12-17T20:45:59Z)
Learning C to x86 Translation: An Experiment in Neural Compilation [3.997680012976965]
Code-to-code neural models have been used in code translation, code refinement and decompilation. In this work, we explore neural compilation, building and evaluating Transformer models that learn how to produce x86 assembler from C code.
arXiv Detail & Related papers (2021-08-17T14:11:15Z)
Representing Partial Programs with Blended Abstract Semantics [62.20775388513027]
We introduce a technique for representing partially written programs in a program synthesis engine. We learn an approximate execution model implemented as a modular neural network. We show that these hybrid neuro-symbolic representations enable execution-guided synthesizers to use more powerful language constructs.
arXiv Detail & Related papers (2020-12-23T20:40:18Z)
Neurocoder: Learning General-Purpose Computation Using Stored Neural Programs [64.56890245622822]
Neurocoder is an entirely new class of general-purpose conditional computational machines. It "codes" itself in a data-responsive way by composing relevant programs from a set of shareable, modular programs. We show new capacity to learn modular programs, handle severe pattern shifts and remember old programs as new ones are learnt.
arXiv Detail & Related papers (2020-09-24T01:39:16Z)
PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives. We develop novel data reuse analysis algorithms using the polyhedral model. We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.