Related papers: WRTester: Differential Testing of WebAssembly Runtimes via Semantic-aware Binary Generation

WRTester: Differential Testing of WebAssembly Runtimes via Semantic-aware Binary Generation

URL: http://arxiv.org/abs/2312.10456v1
Date: Sat, 16 Dec 2023 14:02:42 GMT
Title: WRTester: Differential Testing of WebAssembly Runtimes via Semantic-aware Binary Generation
Authors: Shangtong Cao, Ningyu He, Xinyu She, Yixuan Zhang, Mu Zhang, Haoyu Wang
Abstract summary: We present WRTester, a novel differential testing framework that can generated complicated Wasm test cases by disassembling and assembling real-world Wasm binaries. For further pinpointing the root causes of unexpected behaviors, we design a runtime-agnostic root cause location method to accurately locate bugs. We have uncovered 33 unique bugs in popular Wasm runtimes, among which 25 have been confirmed.
Score: 19.78427170624683
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Wasm runtime is a fundamental component in the Wasm ecosystem, as it directly impacts whether Wasm applications can be executed as expected. Bugs in Wasm runtime bugs are frequently reported, thus our research community has made a few attempts to design automated testing frameworks for detecting bugs in Wasm runtimes. However, existing testing frameworks are limited by the quality of test cases, i.e., they face challenges of generating both semantic-rich and syntactic-correct Wasm binaries, thus complicated bugs cannot be triggered. In this work, we present WRTester, a novel differential testing framework that can generated complicated Wasm test cases by disassembling and assembling of real-world Wasm binaries, which can trigger hidden inconsistencies among Wasm runtimes. For further pinpointing the root causes of unexpected behaviors, we design a runtime-agnostic root cause location method to accurately locate bugs. Extensive evaluation suggests that WRTester outperforms SOTA techniques in terms of both efficiency and effectiveness. We have uncovered 33 unique bugs in popular Wasm runtimes, among which 25 have been confirmed.

Related papers

Do Large Language Model Benchmarks Test Reliability? [66.1783478365998]
We investigate how well current benchmarks quantify model reliability. Motivated by this gap in the evaluation of reliability, we propose the concept of so-called platinum benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks.
arXiv Detail & Related papers (2025-02-05T18:58:19Z)
Distinguishability-guided Test Program Generation for WebAssembly Runtime Performance Testing [28.920256869194315]
High performance is a critical design goal of WebAssembly (Wasm) Research on Wasm runtime performance testing still suffers from insufficient high-quality test programs. In particular, WarpGen has identified seven new performance issues in three Wasm runtimes.
arXiv Detail & Related papers (2024-12-28T09:51:23Z)
Leveraging Stack Traces for Spectrum-based Fault Localization in the Absence of Failing Tests [44.13331329339185]
We introduce a new approach, SBEST, that integrates stack trace data with test coverage to enhance fault localization. Our approach shows a significant improvement, increasing Mean Average Precision (MAP) by 32.22% and Mean Reciprocal Rank (MRR) by 17.43% over traditional stack trace ranking methods.
arXiv Detail & Related papers (2024-05-01T15:15:52Z)
Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution. TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults. Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z)
DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs) It covers four major bug categories and 18 minor types in C++, Java, and Python. We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z)
GitBug-Actions: Building Reproducible Bug-Fix Benchmarks with GitHub Actions [8.508198765617196]
We present GitBug-Actions, a novel tool for building bug-fix benchmarks with modern and fully-reproducible bug-fixes. GitBug-Actions relies on the most popular CI platform, GitHub Actions, to detect bug-fixes. To demonstrate our toolchain, we deploy GitBug-Actions to build a proof-of-concept Go bug-fix benchmark.
arXiv Detail & Related papers (2023-10-24T09:04:14Z)
Automatic Generation of Test Cases based on Bug Reports: a Feasibility Study with Large Language Models [4.318319522015101]
Existing approaches produce test cases that either can be qualified as simple (e.g. unit tests) or that require precise specifications. Most testing procedures still rely on test cases written by humans to form test suites. We investigate the feasibility of performing this generation by leveraging large language models (LLMs) and using bug reports as inputs.
arXiv Detail & Related papers (2023-10-10T05:30:12Z)
Revealing Performance Issues in Server-side WebAssembly Runtimes via Differential Testing [28.187405253760687]
We design a novel differential testing approach WarpDiff to identify performance issues in server-side Wasm runtimes. We identify abnormal cases where the execution time ratio significantly deviates from the oracle ratio and locate the Wasm runtimes that cause the performance issues.
arXiv Detail & Related papers (2023-09-21T15:25:18Z)
PreciseBugCollector: Extensible, Executable and Precise Bug-fix Collection [8.79879909193717]
We introduce PreciseBugCollector, a precise, multi-language bug collection approach. It is based on two novel components: a bug tracker to map the repositories with external bug repositories to trace bug type information, and a bug injector to generate project-specific bugs. To date, PreciseBugCollector comprises 1057818 bugs extracted from 2968 open-source projects.
arXiv Detail & Related papers (2023-09-12T13:47:44Z)
RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair [75.40584530380589]
We propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen) RAP-Gen explicitly leveraging relevant fix patterns retrieved from a list of previous bug-fix pairs. We evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java.
arXiv Detail & Related papers (2023-09-12T08:52:56Z)
Using Developer Discussions to Guide Fixing Bugs in Software [51.00904399653609]
We propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for additional information from developers. We demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits.
arXiv Detail & Related papers (2022-11-11T16:37:33Z)
BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
BigIssue is a benchmark for realistic bug localization. We provide a general benchmark with a diversity of real and synthetic Java bugs. We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
arXiv Detail & Related papers (2022-07-21T20:17:53Z)
On Distribution Shift in Learning-based Bug Detectors [4.511923587827301]
We train a bug detector in two phases, first on a synthetic bug distribution to adapt the model to the bug detection domain, and then on a real bug distribution to drive the model towards the real distribution. We evaluate our approach extensively on three widely studied bug types, for which we construct new datasets carefully designed to capture the real bug distribution.
arXiv Detail & Related papers (2022-04-21T12:17:22Z)
Detecting Rewards Deterioration in Episodic Reinforcement Learning [63.49923393311052]
In many RL applications, once training ends, it is vital to detect any deterioration in the agent performance as soon as possible. We consider an episodic framework, where the rewards within each episode are not independent, nor identically-distributed, nor Markov. We define the mean-shift in a way corresponding to deterioration of a temporal signal (such as the rewards), and derive a test for this problem with optimal statistical power.
arXiv Detail & Related papers (2020-10-22T12:45:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.