Related papers: WaDec: Decompiling WebAssembly Using Large Language Model

WaDec: Decompiling WebAssembly Using Large Language Model

URL: http://arxiv.org/abs/2406.11346v3
Date: Wed, 11 Sep 2024 10:05:37 GMT
Title: WaDec: Decompiling WebAssembly Using Large Language Model
Authors: Xinyu She, Yanjie Zhao, Haoyu Wang,
Abstract summary: WebAssembly (abbreviated Wasm) has emerged as a cornerstone of web development. Despite its advantages, Wasm's binary nature presents significant challenges for developers and researchers. We introduce a novel approach, WaDec, which is the first use of a fine-tuned LLM to interpret and decompile Wasm binary code.
Score: 5.667013605202579
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: WebAssembly (abbreviated Wasm) has emerged as a cornerstone of web development, offering a compact binary format that allows high-performance applications to run at near-native speeds in web browsers. Despite its advantages, Wasm's binary nature presents significant challenges for developers and researchers, particularly regarding readability when debugging or analyzing web applications. Therefore, effective decompilation becomes crucial. Unfortunately, traditional decompilers often struggle with producing readable outputs. While some large language model (LLM)-based decompilers have shown good compatibility with general binary files, they still face specific challenges when dealing with Wasm. In this paper, we introduce a novel approach, WaDec, which is the first use of a fine-tuned LLM to interpret and decompile Wasm binary code into a higher-level, more comprehensible source code representation. The LLM was meticulously fine-tuned using a specialized dataset of wat-c code snippets, employing self-supervised learning techniques. This enables WaDec to effectively decompile not only complete wat functions but also finer-grained wat code snippets. Our experiments demonstrate that WaDec markedly outperforms current state-of-the-art tools, offering substantial improvements across several metrics. It achieves a code inflation rate of only 3.34%, a dramatic 97% reduction compared to the state-of-the-art's 116.94%. Unlike baselines' output that cannot be directly compiled or executed, WaDec maintains a recompilability rate of 52.11%, a re-execution rate of 43.55%, and an output consistency of 27.15%. Additionally, it significantly exceeds state-of-the-art performance in AST edit distance similarity by 185%, cyclomatic complexity by 8%, and cosine similarity by 41%, achieving an average code similarity above 50%.

Related papers

Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation [129.45368843861917]
We introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers.<n>We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs to share memory readout states from a Samba-based self-decoder.
arXiv Detail & Related papers (2025-07-09T07:27:00Z)
Decompiling Smart Contracts with a Large Language Model [51.49197239479266]
Despite Etherscan's 78,047,845 smart contracts deployed on (as of May 26, 2025), a mere 767,520 ( 1%) are open source.<n>This opacity necessitates the automated semantic analysis of on-chain smart contract bytecode.<n>We introduce a pioneering decompilation pipeline that transforms bytecode into human-readable and semantically faithful Solidity code.
arXiv Detail & Related papers (2025-06-24T13:42:59Z)
D-LiFT: Improving LLM-based Decompiler Backend via Code Quality-driven Fine-tuning [49.16469288280772]
We present D-LiFT, an automated decompiler backend that harnesses and trains LLMs to improve the quality of decompiled code via reinforcement learning (RL)<n>D-LiFT adheres to a key principle for enhancing the quality of decompiled code: textitpreserving accuracy while improving readability.<n>Central to D-LiFT, we propose D-SCORE, an integrated quality assessment system to score the decompiled code from multiple aspects.
arXiv Detail & Related papers (2025-06-11T19:09:08Z)
Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation [12.983487033256448]
Decompile-Bench is the first open-source dataset comprising two million binary-source function pairs condensed from 100 million collected function pairs.<n>For the evaluation purposes, we developed a benchmark Decompile-Bench-Eval including manually crafted binaries from the well-established HumanEval and MBPP.<n>We find that fine-tuning with Decompile-Bench causes a 20% improvement over previous benchmarks in terms of the re-executability rate.
arXiv Detail & Related papers (2025-05-19T03:34:33Z)
Is Compression Really Linear with Code Intelligence? [60.123628177110206]
textitFormat Annealing is a lightweight, transparent training methodology designed to assess the intrinsic capabilities of pre-trained models equitably.<n>Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and bits-per-character (BPC)<n>Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.
arXiv Detail & Related papers (2025-05-16T16:59:14Z)
Simplicity by Obfuscation: Evaluating LLM-Driven Code Transformation with Semantic Elasticity [4.458584890504334]
Code obfuscation aims to prevent reverse engineering and intellectual property theft. The recent development of large language models paves the way for practical applications in different domains. This work performs an empirical study on the ability of LLMs to obfuscate Python source code.
arXiv Detail & Related papers (2025-04-18T18:29:23Z)
BitNet b1.58 2B4T Technical Report [118.78752947128682]
We introduce BitNet b1.58 2B4T, the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale. Trained on a corpus of 4 trillion tokens, the model has been rigorously evaluated across benchmarks covering language understanding, mathematical reasoning, coding proficiency, and conversational ability.
arXiv Detail & Related papers (2025-04-16T17:51:43Z)
ReF Decompile: Relabeling and Function Call Enhanced Decompile [50.86228893636785]
The goal of decompilation is to convert compiled low-level code (e.g., assembly code) back into high-level programming languages. This task supports various reverse engineering applications, such as vulnerability identification, malware analysis, and legacy software migration.
arXiv Detail & Related papers (2025-02-17T12:38:57Z)
Idioms: Neural Decompilation With Joint Code and Type Definition Prediction [7.421408987075001]
We introduce a new dataset, Realtype, that includes substantially more complicated and realistic types than existing neural decompilation benchmarks.<n>We show that our approach yields state-of-the-art results in neural decompilation.
arXiv Detail & Related papers (2025-02-06T22:13:40Z)
Is This the Same Code? A Comprehensive Study of Decompilation Techniques for WebAssembly Binaries [4.66875056781341]
We present a novel framework for empirically evaluating C-based decompilers from various aspects including correctness/ readability/ and structural similarity. This in turn contributes to bolstering the security and reliability of software systems that rely on WASM and native binaries.
arXiv Detail & Related papers (2024-11-04T17:08:03Z)
Understanding Code Understandability Improvements in Code Reviews [79.16476505761582]
We analyzed 2,401 code review comments from Java open-source projects on GitHub. 83.9% of suggestions for improvement were accepted and integrated, with fewer than 1% later reverted.
arXiv Detail & Related papers (2024-10-29T12:21:23Z)
Demystifying and Assessing Code Understandability in Java Decompilation [3.2671789531342457]
Decompilation, the process of converting machine-level code into readable source code, plays a critical role in reverse engineering. We propose the first empirical study on the understandability of Java decompiled code.
arXiv Detail & Related papers (2024-09-30T14:44:00Z)
Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask [74.64216073678617]
AMD performs parallel NAR inference within contiguous blocks of output labels concealed using attention masks. A beam search algorithm is designed to leverage a dynamic fusion of CTC, AR Decoder, and AMD probabilities. Experiments on the LibriSpeech-100hr corpus suggest the tripartite Decoder incorporating the AMD module produces a maximum decoding speed-up ratio of 1.73x.
arXiv Detail & Related papers (2024-06-14T13:42:38Z)
Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants. Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts.
arXiv Detail & Related papers (2024-05-25T08:57:28Z)
Theoretically Achieving Continuous Representation of Oriented Bounding Boxes [64.15627958879053]
This paper endeavors to completely solve the issue of discontinuity in Oriented Bounding Box representation. We propose a novel representation method called Continuous OBB (COBB) which can be readily integrated into existing detectors. For fairness and transparency of experiments, we have developed a modularized benchmark based on the open-source deep learning framework Jittor's detection toolbox JDet for OOD evaluation.
arXiv Detail & Related papers (2024-02-29T09:27:40Z)
ReGAL: Refactoring Programs to Discover Generalizable Abstractions [59.05769810380928]
Generalizable Abstraction Learning (ReGAL) is a method for learning a library of reusable functions via codeization. We find that the shared function libraries discovered by ReGAL make programs easier to predict across diverse domains. For CodeLlama-13B, ReGAL results in absolute accuracy increases of 11.5% on LOGO, 26.1% on date understanding, and 8.1% on TextCraft, outperforming GPT-3.5 in two of three domains.
arXiv Detail & Related papers (2024-01-29T18:45:30Z)
Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries [4.0484792045035505]
We extend large pre-trained language models of source code to summarise decompiled binary functions. We investigate the impact of input and data properties on the performance of such models. BinT5 achieves the state-of-the-art BLEU-4 score of 60.83, 58.82, and 44.21 for summarising source, decompiled, and synthetically stripped decompiled code.
arXiv Detail & Related papers (2023-01-04T16:56:33Z)
Boosting Neural Networks to Decompile Optimized Binaries [13.255618541522436]
Decompilation aims to transform a low-level program language (LPL) into its functionally-equivalent high-level program language (HPL) We propose a novel learning-based approach named NeurDP, that targets compiler-optimized binaries.
arXiv Detail & Related papers (2023-01-03T06:45:54Z)
Improving type information inferred by decompilers with supervised machine learning [0.0]
In software reverse engineering, decompilation is the process of recovering source code from binary files. We build different classification models capable of inferring the high-level type returned by functions. Our system is able to predict function return types with a 79.1% F1-measure, whereas the best decompiler obtains a 30% F1-measure.
arXiv Detail & Related papers (2021-01-19T11:45:46Z)
PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives. We develop novel data reuse analysis algorithms using the polyhedral model. We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.