StackSight: Unveiling WebAssembly through Large Language Models and Neurosymbolic Chain-of-Thought Decompilation
- URL: http://arxiv.org/abs/2406.04568v1
- Date: Fri, 7 Jun 2024 01:08:17 GMT
- Title: StackSight: Unveiling WebAssembly through Large Language Models and Neurosymbolic Chain-of-Thought Decompilation
- Authors: Weike Fang, Zhejian Zhou, Junzhou He, Weihang Wang,
- Abstract summary: StackSight visualizes and tracks virtual stack alterations via a static analysis algorithm and then applies chain-of-thought prompting.
Evaluation results show that StackSight significantly improves WebAssembly decompilation.
Our user study also demonstrates that code snippets generated by StackSight have significantly higher win rates and enable a better grasp of code semantics.
- Score: 2.1094456929188676
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: WebAssembly enables near-native execution in web applications and is increasingly adopted for tasks that demand high performance and robust security. However, its assembly-like syntax, implicit stack machine, and low-level data types make it extremely difficult for human developers to understand, spurring the need for effective WebAssembly reverse engineering techniques. In this paper, we propose StackSight, a novel neurosymbolic approach that combines Large Language Models (LLMs) with advanced program analysis to decompile complex WebAssembly code into readable C++ snippets. StackSight visualizes and tracks virtual stack alterations via a static analysis algorithm and then applies chain-of-thought prompting to harness LLM's complex reasoning capabilities. Evaluation results show that StackSight significantly improves WebAssembly decompilation. Our user study also demonstrates that code snippets generated by StackSight have significantly higher win rates and enable a better grasp of code semantics.
Related papers
- Is This the Same Code? A Comprehensive Study of Decompilation Techniques for WebAssembly Binaries [4.66875056781341]
We present a novel framework for empirically evaluating C-based decompilers from various aspects including correctness/ readability/ and structural similarity.
This in turn contributes to bolstering the security and reliability of software systems that rely on WASM and native binaries.
arXiv Detail & Related papers (2024-11-04T17:08:03Z) - Building Call Graph of WebAssembly Programs via Abstract Semantics [0.24103772239130034]
WebAssembly is a binary format for code that is gaining popularity thanks to its focus on portability and performance.
The binary format of WebAssembly makes it prone to being used as a vehicle for malicious software.
There is substantial interest in developing tools for WebAssembly security verification, information flow control, and, more generally, for verifying behavioral properties.
arXiv Detail & Related papers (2024-07-08T09:32:47Z) - Multi-modal Learning for WebAssembly Reverse Engineering [7.18491643197374]
We present WasmRev, the first multi-modal pre-trained language model for WebAssembly reverse engineering.
WasmRev is pre-trained using self-supervised learning on a large-scale multi-modal corpus.
We fine-tune WasmRev onto three important reverse engineering tasks: type recovery, function purpose identification and WebAssembly summarization.
arXiv Detail & Related papers (2024-04-04T03:03:38Z) - Code-Switched Language Identification is Harder Than You Think [69.63439391717691]
Code switching is a common phenomenon in written and spoken communication.
We look at the application of building CS corpora.
We make the task more realistic by scaling it to more languages.
We reformulate the task as a sentence-level multi-label tagging problem to make it more tractable.
arXiv Detail & Related papers (2024-02-02T15:38:47Z) - SoK: Analysis techniques for WebAssembly [0.0]
WebAssembly is a low-level bytecode language that allows languages like C, C++, and Rust to be executed in the browser at near-native performance.
Vulnerabilities in memory-unsafe languages, like C and C++, can translate into vulnerabilities in WebAssembly binaries.
WebAssembly has been used for malicious purposes like cryptojacking.
arXiv Detail & Related papers (2024-01-11T14:28:13Z) - LILO: Learning Interpretable Libraries by Compressing and Documenting Code [71.55208585024198]
We introduce LILO, a neurosymbolic framework that iteratively synthesizes, compresses, and documents code.
LILO combines LLM-guided program synthesis with recent algorithmic advances in automated from Stitch.
We find that AutoDoc boosts performance by helping LILO's synthesizer to interpret and deploy learned abstractions.
arXiv Detail & Related papers (2023-10-30T17:55:02Z) - When Do Program-of-Thoughts Work for Reasoning? [51.2699797837818]
We propose complexity-impacted reasoning score (CIRS) to measure correlation between code and reasoning abilities.
Specifically, we use the abstract syntax tree to encode the structural information and calculate logical complexity.
Code will be integrated into the EasyInstruct framework at https://github.com/zjunlp/EasyInstruct.
arXiv Detail & Related papers (2023-08-29T17:22:39Z) - CodeIE: Large Code Generation Models are Better Few-Shot Information
Extractors [92.17328076003628]
Large language models (LLMs) pre-trained on massive corpora have demonstrated impressive few-shot learning ability on many NLP tasks.
In this paper, we propose to recast the structured output in the form of code instead of natural language.
arXiv Detail & Related papers (2023-05-09T18:40:31Z) - Beyond the C: Retargetable Decompilation using Neural Machine
Translation [5.734661402742406]
We develop a prototype decompiler that is easily retargetable to new languages.
We examine the impact of parameters such as tokenization and training data selection on the quality of decompilation.
We will release our training data, trained decompilation models, and code to help encourage future research into language-agnostic decompilation.
arXiv Detail & Related papers (2022-12-17T20:45:59Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - COSEA: Convolutional Code Search with Layer-wise Attention [90.35777733464354]
We propose a new deep learning architecture, COSEA, which leverages convolutional neural networks with layer-wise attention to capture the code's intrinsic structural logic.
COSEA can achieve significant improvements over state-of-the-art methods on code search tasks.
arXiv Detail & Related papers (2020-10-19T13:53:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.