Related papers: StackSight: Unveiling WebAssembly through Large Language Models and Neurosymbolic Chain-of-Thought Decompilation

StackSight: Unveiling WebAssembly through Large Language Models and Neurosymbolic Chain-of-Thought Decompilation

URL: http://arxiv.org/abs/2406.04568v1
Date: Fri, 7 Jun 2024 01:08:17 GMT
Title: StackSight: Unveiling WebAssembly through Large Language Models and Neurosymbolic Chain-of-Thought Decompilation
Authors: Weike Fang, Zhejian Zhou, Junzhou He, Weihang Wang,
Abstract summary: StackSight visualizes and tracks virtual stack alterations via a static analysis algorithm and then applies chain-of-thought prompting. Evaluation results show that StackSight significantly improves WebAssembly decompilation. Our user study also demonstrates that code snippets generated by StackSight have significantly higher win rates and enable a better grasp of code semantics.
Score: 2.1094456929188676
License: http://creativecommons.org/licenses/by/4.0/
Abstract: WebAssembly enables near-native execution in web applications and is increasingly adopted for tasks that demand high performance and robust security. However, its assembly-like syntax, implicit stack machine, and low-level data types make it extremely difficult for human developers to understand, spurring the need for effective WebAssembly reverse engineering techniques. In this paper, we propose StackSight, a novel neurosymbolic approach that combines Large Language Models (LLMs) with advanced program analysis to decompile complex WebAssembly code into readable C++ snippets. StackSight visualizes and tracks virtual stack alterations via a static analysis algorithm and then applies chain-of-thought prompting to harness LLM's complex reasoning capabilities. Evaluation results show that StackSight significantly improves WebAssembly decompilation. Our user study also demonstrates that code snippets generated by StackSight have significantly higher win rates and enable a better grasp of code semantics.

Related papers

WebThinker: Empowering Large Reasoning Models with Deep Research Capability [60.81964498221952]
WebThinker is a deep research agent that empowers large reasoning models to autonomously search the web, navigate web pages, and draft research reports during the reasoning process. It also employs an textbfAutonomous Think-Search-and-Draft strategy, allowing the model to seamlessly interleave reasoning, information gathering, and report writing in real time. Our approach enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems.
arXiv Detail & Related papers (2025-04-30T16:25:25Z)
Reusing Legacy Code in WebAssembly: Key Challenges of Cross-Compilation and Code Semantics Preservation [4.796846173598521]
We investigate how well WebAssembly compilers fulfill code reusability. We identify the key challenges in cross-compiling legacy C/C++ code into WebAssembly. Using WasmChecker, we provide a witness that WebAssembly compilers do not necessarily preserve code semantics.
arXiv Detail & Related papers (2024-12-28T20:24:41Z)
FullStack Bench: Evaluating LLMs as Full Stack Coders [108.63536080569877]
FullStack Bench focuses on full-stack programming, which encompasses a wide range of application domains. To assess multilingual programming capabilities, in FullStack Bench, we design real-world instructions and corresponding unit test cases from 16 widely-used programming languages.
arXiv Detail & Related papers (2024-11-30T16:58:42Z)
EnStack: An Ensemble Stacking Framework of Large Language Models for Enhanced Vulnerability Detection in Source Code [1.9374282535132379]
We introduce EnStack, a novel ensemble stacking framework that enhances vulnerability detection using natural language processing (NLP) techniques. Our approach synergizes multiple pre-trained large language models (LLMs) specialized in code understanding. meta-classifiers consolidate the strengths of each LLM, resulting in a comprehensive model that excels in detecting subtle and complex vulnerabilities.
arXiv Detail & Related papers (2024-11-25T16:47:10Z)
Is This the Same Code? A Comprehensive Study of Decompilation Techniques for WebAssembly Binaries [4.66875056781341]
We present a novel framework for empirically evaluating C-based decompilers from various aspects including correctness/ readability/ and structural similarity. This in turn contributes to bolstering the security and reliability of software systems that rely on WASM and native binaries.
arXiv Detail & Related papers (2024-11-04T17:08:03Z)
Building Call Graph of WebAssembly Programs via Abstract Semantics [0.24103772239130034]
WebAssembly is a binary format for code that is gaining popularity thanks to its focus on portability and performance. The binary format of WebAssembly makes it prone to being used as a vehicle for malicious software. There is substantial interest in developing tools for WebAssembly security verification, information flow control, and, more generally, for verifying behavioral properties.
arXiv Detail & Related papers (2024-07-08T09:32:47Z)
Multi-modal Learning for WebAssembly Reverse Engineering [7.18491643197374]
We present WasmRev, the first multi-modal pre-trained language model for WebAssembly reverse engineering. WasmRev is pre-trained using self-supervised learning on a large-scale multi-modal corpus. We fine-tune WasmRev onto three important reverse engineering tasks: type recovery, function purpose identification and WebAssembly summarization.
arXiv Detail & Related papers (2024-04-04T03:03:38Z)
Code-Switched Language Identification is Harder Than You Think [69.63439391717691]
Code switching is a common phenomenon in written and spoken communication. We look at the application of building CS corpora. We make the task more realistic by scaling it to more languages. We reformulate the task as a sentence-level multi-label tagging problem to make it more tractable.
arXiv Detail & Related papers (2024-02-02T15:38:47Z)
SoK: Analysis techniques for WebAssembly [0.0]
WebAssembly is a low-level bytecode language that allows languages like C, C++, and Rust to be executed in the browser at near-native performance. Vulnerabilities in memory-unsafe languages, like C and C++, can translate into vulnerabilities in WebAssembly binaries. WebAssembly has been used for malicious purposes like cryptojacking.
arXiv Detail & Related papers (2024-01-11T14:28:13Z)
LILO: Learning Interpretable Libraries by Compressing and Documenting Code [71.55208585024198]
We introduce LILO, a neurosymbolic framework that iteratively synthesizes, compresses, and documents code. LILO combines LLM-guided program synthesis with recent algorithmic advances in automated from Stitch. We find that AutoDoc boosts performance by helping LILO's synthesizer to interpret and deploy learned abstractions.
arXiv Detail & Related papers (2023-10-30T17:55:02Z)
When Do Program-of-Thoughts Work for Reasoning? [51.2699797837818]
We propose complexity-impacted reasoning score (CIRS) to measure correlation between code and reasoning abilities. Specifically, we use the abstract syntax tree to encode the structural information and calculate logical complexity. Code will be integrated into the EasyInstruct framework at https://github.com/zjunlp/EasyInstruct.
arXiv Detail & Related papers (2023-08-29T17:22:39Z)
CodeIE: Large Code Generation Models are Better Few-Shot Information Extractors [92.17328076003628]
Large language models (LLMs) pre-trained on massive corpora have demonstrated impressive few-shot learning ability on many NLP tasks. In this paper, we propose to recast the structured output in the form of code instead of natural language.
arXiv Detail & Related papers (2023-05-09T18:40:31Z)
Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z)
COSEA: Convolutional Code Search with Layer-wise Attention [90.35777733464354]
We propose a new deep learning architecture, COSEA, which leverages convolutional neural networks with layer-wise attention to capture the code's intrinsic structural logic. COSEA can achieve significant improvements over state-of-the-art methods on code search tasks.
arXiv Detail & Related papers (2020-10-19T13:53:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.