Self-Constructed Context Decompilation with Fined-grained Alignment Enhancement
- URL: http://arxiv.org/abs/2406.17233v2
- Date: Thu, 03 Oct 2024 08:43:25 GMT
- Title: Self-Constructed Context Decompilation with Fined-grained Alignment Enhancement
- Authors: Yunlong Feng, Dechuan Teng, Yang Xu, Honglin Mu, Xiao Xu, Libo Qin, Qingfu Zhu, Wanxiang Che,
- Abstract summary: Decompilation transforms compiled code back into a high-level programming language when source code is unavailable.
Previous work has primarily focused on enhancing decompilation performance by increasing the scale of model parameters or training data for pre-training.
By integrating these two methods, we achieved a Re-Executability performance improvement of approximately 3.90% on the Decompile-Eval benchmark, establishing a new state-of-the-art performance of 52.41%.
- Score: 43.2637367483626
- License:
- Abstract: Decompilation transforms compiled code back into a high-level programming language for analysis when source code is unavailable. Previous work has primarily focused on enhancing decompilation performance by increasing the scale of model parameters or training data for pre-training. Based on the characteristics of the decompilation task, we propose two methods: (1) Without fine-tuning, the Self-Constructed Context Decompilation (sc$^2$dec) method recompiles the LLM's decompilation results to construct pairs for in-context learning, helping the model improve decompilation performance. (2) Fine-grained Alignment Enhancement (FAE), which meticulously aligns assembly code with source code at the statement level by leveraging debugging information, is employed during the fine-tuning phase to achieve further improvements in decompilation. By integrating these two methods, we achieved a Re-Executability performance improvement of approximately 3.90% on the Decompile-Eval benchmark, establishing a new state-of-the-art performance of 52.41%. The code, data, and models are available at https://github.com/AlongWY/sccdec.
Related papers
- CompilerDream: Learning a Compiler World Model for General Code Optimization [58.87557583347996]
We introduce CompilerDream, a model-based reinforcement learning approach to general code optimization.
It comprises a compiler world model that accurately simulates the intrinsic properties of optimization passes and an agent trained on this model to produce effective optimization strategies.
It excels across diverse datasets, surpassing LLVM's built-in optimizations and other state-of-the-art methods in both settings of value prediction and end-to-end code optimization.
arXiv Detail & Related papers (2024-04-24T09:20:33Z) - LLM4Decompile: Decompiling Binary Code with Large Language Models [10.346311290153398]
Decompilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results difficult to read and execute.
We propose LLM4Decompile, the first and largest open-source LLM series (1.3B to 33B) trained to decompile binary code.
The resulting models significantly outperform GPT-4o and Ghidra on the HumanEval and ExeBench benchmarks by over 100% in terms of re-executability rate.
arXiv Detail & Related papers (2024-03-08T13:10:59Z) - StepCoder: Improve Code Generation with Reinforcement Learning from
Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components.
CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks.
FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization.
Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - Large Language Models for Compiler Optimization [22.52765975286403]
We present a transformer model trained from scratch to optimize LLVM assembly for code size.
We ask the model to predict the instruction counts before and after optimization, and the optimized code itself.
Our approach achieves a 3.0% improvement in reducing instruction counts over the compiler.
arXiv Detail & Related papers (2023-09-11T22:11:46Z) - Learning Performance-Improving Code Edits [107.21538852090208]
We introduce a framework for adapting large language models (LLMs) to high-level program optimization.
First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs.
For prompting, we propose retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play.
arXiv Detail & Related papers (2023-02-15T18:59:21Z) - Extending Source Code Pre-Trained Language Models to Summarise
Decompiled Binaries [4.0484792045035505]
We extend large pre-trained language models of source code to summarise decompiled binary functions.
We investigate the impact of input and data properties on the performance of such models.
BinT5 achieves the state-of-the-art BLEU-4 score of 60.83, 58.82, and 44.21 for summarising source, decompiled, and synthetically stripped decompiled code.
arXiv Detail & Related papers (2023-01-04T16:56:33Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - Static Neural Compiler Optimization via Deep Reinforcement Learning [1.458855293397494]
In this paper, we employ a deep reinforcement learning approach to the phase-ordering problem.
Provided with sub-sequences constituting LLVM's O3 sequence, our agent learns to outperform the O3 sequence on the set of source codes used for training.
We believe that the models trained using our approach can be integrated into modern compilers as neural optimization agents.
arXiv Detail & Related papers (2020-08-20T13:16:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.