Reusing Legacy Code in WebAssembly: Key Challenges of Cross-Compilation and Code Semantics Preservation
- URL: http://arxiv.org/abs/2412.20258v1
- Date: Sat, 28 Dec 2024 20:24:41 GMT
- Title: Reusing Legacy Code in WebAssembly: Key Challenges of Cross-Compilation and Code Semantics Preservation
- Authors: Sara Baradaran, Liyan Huang, Mukund Raghothaman, Weihang Wang,
- Abstract summary: We investigate how well WebAssembly compilers fulfill code reusability.
We identify the key challenges in cross-compiling legacy C/C++ code into WebAssembly.
Using WasmChecker, we provide a witness that WebAssembly compilers do not necessarily preserve code semantics.
- Score: 4.796846173598521
- License:
- Abstract: WebAssembly (Wasm) has emerged as a powerful technology for executing high-performance code and reusing legacy code in web browsers. With its increasing adoption, ensuring the reliability of WebAssembly code becomes paramount. In this paper, we investigate how well WebAssembly compilers fulfill code reusability. Specifically, we inquire (1) what challenges arise when cross-compiling a high-level language codebase into WebAssembly and (2) how faithfully WebAssembly compilers preserve code semantics in this new binary. Through a study on 115 open-source codebases, we identify the key challenges in cross-compiling legacy C/C++ code into WebAssembly, highlighting the risks of silent miscompilation and compile-time errors. We categorize these challenges based on their root causes and propose corresponding solutions. We then introduce a differential testing approach, implemented in a framework named WasmChecker, to investigate the semantics equivalency of code between native x86-64 and WebAssembly binaries. Using WasmChecker, we provide a witness that WebAssembly compilers do not necessarily preserve code semantics when cross-compiling high-level language code into WebAssembly due to different implementations of standard libraries, unsupported system calls/APIs, WebAssembly's unique features, and compiler bugs. Furthermore, we have identified 11 new bugs in the Emscripten compiler toolchain, all confirmed by Emscripten developers. As proof of concept, we make our framework and the collected dataset of open-source codebases publicly available.
Related papers
- Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch.
Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests.
Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z) - RedCode: Risky Code Execution and Generation Benchmark for Code Agents [50.81206098588923]
RedCode is a benchmark for risky code execution and generation.
RedCode-Exec provides challenging prompts that could lead to risky code execution.
RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions.
arXiv Detail & Related papers (2024-11-12T13:30:06Z) - Is This the Same Code? A Comprehensive Study of Decompilation Techniques for WebAssembly Binaries [4.66875056781341]
We present a novel framework for empirically evaluating C-based decompilers from various aspects including correctness/ readability/ and structural similarity.
This in turn contributes to bolstering the security and reliability of software systems that rely on WASM and native binaries.
arXiv Detail & Related papers (2024-11-04T17:08:03Z) - ViC: Virtual Compiler Is All You Need For Assembly Code Search [9.674880905252628]
This paper explores training a Large Language Model (LLM) to emulate a general compiler.
We further pre-train CodeLlama as a Virtual Compiler (ViC), capable of compiling any source code of any language to assembly code.
We achieve a substantial improvement in assembly code search performance, with our model surpassing the leading baseline by 26%.
arXiv Detail & Related papers (2024-08-10T17:23:02Z) - Building Call Graph of WebAssembly Programs via Abstract Semantics [0.24103772239130034]
WebAssembly is a binary format for code that is gaining popularity thanks to its focus on portability and performance.
The binary format of WebAssembly makes it prone to being used as a vehicle for malicious software.
There is substantial interest in developing tools for WebAssembly security verification, information flow control, and, more generally, for verifying behavioral properties.
arXiv Detail & Related papers (2024-07-08T09:32:47Z) - VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development.
We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM)
We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - StackSight: Unveiling WebAssembly through Large Language Models and Neurosymbolic Chain-of-Thought Decompilation [2.1094456929188676]
StackSight visualizes and tracks virtual stack alterations via a static analysis algorithm and then applies chain-of-thought prompting.
Evaluation results show that StackSight significantly improves WebAssembly decompilation.
Our user study also demonstrates that code snippets generated by StackSight have significantly higher win rates and enable a better grasp of code semantics.
arXiv Detail & Related papers (2024-06-07T01:08:17Z) - How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z) - SoK: Analysis techniques for WebAssembly [0.0]
WebAssembly is a low-level bytecode language that allows languages like C, C++, and Rust to be executed in the browser at near-native performance.
Vulnerabilities in memory-unsafe languages, like C and C++, can translate into vulnerabilities in WebAssembly binaries.
WebAssembly has been used for malicious purposes like cryptojacking.
arXiv Detail & Related papers (2024-01-11T14:28:13Z) - InterCode: Standardizing and Benchmarking Interactive Coding with
Execution Feedback [50.725076393314964]
We introduce InterCode, a lightweight, flexible, and easy-to-use framework of interactive coding as a standard reinforcement learning environment.
Our framework is language and platform agnostic, uses self-contained Docker environments to provide safe and reproducible execution.
We demonstrate InterCode's viability as a testbed by evaluating multiple state-of-the-art LLMs configured with different prompting strategies.
arXiv Detail & Related papers (2023-06-26T17:59:50Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.