WRTester: Differential Testing of WebAssembly Runtimes via
  Semantic-aware Binary Generation
        - URL: http://arxiv.org/abs/2312.10456v1
- Date: Sat, 16 Dec 2023 14:02:42 GMT
- Title: WRTester: Differential Testing of WebAssembly Runtimes via
  Semantic-aware Binary Generation
- Authors: Shangtong Cao, Ningyu He, Xinyu She, Yixuan Zhang, Mu Zhang, Haoyu
  Wang
- Abstract summary: We present WRTester, a novel differential testing framework that can generated complicated Wasm test cases by disassembling and assembling real-world Wasm binaries.
For further pinpointing the root causes of unexpected behaviors, we design a runtime-agnostic root cause location method to accurately locate bugs.
We have uncovered 33 unique bugs in popular Wasm runtimes, among which 25 have been confirmed.
- Score: 19.78427170624683
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Wasm runtime is a fundamental component in the Wasm ecosystem, as it directly
impacts whether Wasm applications can be executed as expected. Bugs in Wasm
runtime bugs are frequently reported, thus our research community has made a
few attempts to design automated testing frameworks for detecting bugs in Wasm
runtimes. However, existing testing frameworks are limited by the quality of
test cases, i.e., they face challenges of generating both semantic-rich and
syntactic-correct Wasm binaries, thus complicated bugs cannot be triggered. In
this work, we present WRTester, a novel differential testing framework that can
generated complicated Wasm test cases by disassembling and assembling of
real-world Wasm binaries, which can trigger hidden inconsistencies among Wasm
runtimes. For further pinpointing the root causes of unexpected behaviors, we
design a runtime-agnostic root cause location method to accurately locate bugs.
Extensive evaluation suggests that WRTester outperforms SOTA techniques in
terms of both efficiency and effectiveness. We have uncovered 33 unique bugs in
popular Wasm runtimes, among which 25 have been confirmed.
 
      
        Related papers
        - AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing   Tests [0.7564784873669823]
 We introduce AssertFlip, a technique for automatically generating Bug Reproducible Tests (BRTs) using large language models (LLMs)<n>AssertFlip first generates passing tests on the buggy behaviour and then inverts these tests to fail when the bug is present.<n>Our results show that AssertFlip outperforms all known techniques in the leaderboard of SWT-Bench, a benchmark curated for BRTs.
 arXiv  Detail & Related papers  (2025-07-23T14:19:55Z)
- From Reproduction to Replication: Evaluating Research Agents with   Progressive Code Masking [48.90371827091671]
 AutoExperiment is a benchmark that evaluates AI agents' ability to implement and run machine learning experiments.<n>We evaluate state-of-the-art agents and find that performance degrades rapidly as $n$ increases.<n>Our findings highlight critical challenges in long-horizon code generation, context retrieval, and autonomous experiment execution.
 arXiv  Detail & Related papers  (2025-06-24T15:39:20Z)
- Do Large Language Model Benchmarks Test Reliability? [66.1783478365998]
 We investigate how well current benchmarks quantify model reliability.
Motivated by this gap in the evaluation of reliability, we propose the concept of so-called platinum benchmarks.
We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks.
 arXiv  Detail & Related papers  (2025-02-05T18:58:19Z)
- Distinguishability-guided Test Program Generation for WebAssembly   Runtime Performance Testing [28.920256869194315]
 High performance is a critical design goal of WebAssembly (Wasm)
Research on Wasm runtime performance testing still suffers from insufficient high-quality test programs.
In particular, WarpGen has identified seven new performance issues in three Wasm runtimes.
 arXiv  Detail & Related papers  (2024-12-28T09:51:23Z)
- Leveraging Stack Traces for Spectrum-based Fault Localization in the   Absence of Failing Tests [44.13331329339185]
 We introduce a new approach, SBEST, that integrates stack trace data with test coverage to enhance fault localization.
Our approach shows a significant improvement, increasing Mean Average Precision (MAP) by 32.22% and Mean Reciprocal Rank (MRR) by 17.43% over traditional stack trace ranking methods.
 arXiv  Detail & Related papers  (2024-05-01T15:15:52Z)
- A Unified Debugging Approach via LLM-Based Multi-Agent Synergy [39.11825182386288]
 FixAgent is an end-to-end framework for unified debug through multi-agent synergy.
It significantly outperforms state-of-the-art repair methods, fixing 1.25$times$ to 2.56$times$ bugs on the repo-level benchmark, Defects4J.
 arXiv  Detail & Related papers  (2024-04-26T04:55:35Z)
- Observation-based unit test generation at Meta [52.4716552057909]
 TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution.
TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults.
Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
 arXiv  Detail & Related papers  (2024-02-09T00:34:39Z)
- DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
 DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
 arXiv  Detail & Related papers  (2024-01-09T15:46:38Z)
- GitBug-Actions: Building Reproducible Bug-Fix Benchmarks with GitHub
  Actions [8.508198765617196]
 We present GitBug-Actions, a novel tool for building bug-fix benchmarks with modern and fully-reproducible bug-fixes.
GitBug-Actions relies on the most popular CI platform, GitHub Actions, to detect bug-fixes.
To demonstrate our toolchain, we deploy GitBug-Actions to build a proof-of-concept Go bug-fix benchmark.
 arXiv  Detail & Related papers  (2023-10-24T09:04:14Z)
- Automatic Generation of Test Cases based on Bug Reports: a Feasibility
  Study with Large Language Models [4.318319522015101]
 Existing approaches produce test cases that either can be qualified as simple (e.g. unit tests) or that require precise specifications.
Most testing procedures still rely on test cases written by humans to form test suites.
We investigate the feasibility of performing this generation by leveraging large language models (LLMs) and using bug reports as inputs.
 arXiv  Detail & Related papers  (2023-10-10T05:30:12Z)
- Revealing Performance Issues in Server-side WebAssembly Runtimes via
  Differential Testing [28.187405253760687]
 We design a novel differential testing approach WarpDiff to identify performance issues in server-side Wasm runtimes.
We identify abnormal cases where the execution time ratio significantly deviates from the oracle ratio and locate the Wasm runtimes that cause the performance issues.
 arXiv  Detail & Related papers  (2023-09-21T15:25:18Z)
- PreciseBugCollector: Extensible, Executable and Precise Bug-fix
  Collection [8.79879909193717]
 We introduce PreciseBugCollector, a precise, multi-language bug collection approach.
It is based on two novel components: a bug tracker to map the repositories with external bug repositories to trace bug type information, and a bug injector to generate project-specific bugs.
To date, PreciseBugCollector comprises 1057818 bugs extracted from 2968 open-source projects.
 arXiv  Detail & Related papers  (2023-09-12T13:47:44Z)
- RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic
  Program Repair [75.40584530380589]
 We propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen)
RAP-Gen explicitly leveraging relevant fix patterns retrieved from a list of previous bug-fix pairs.
We evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java.
 arXiv  Detail & Related papers  (2023-09-12T08:52:56Z)
- Using Developer Discussions to Guide Fixing Bugs in Software [51.00904399653609]
 We propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for additional information from developers.
We demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits.
 arXiv  Detail & Related papers  (2022-11-11T16:37:33Z)
- BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
 BigIssue is a benchmark for realistic bug localization.
We provide a general benchmark with a diversity of real and synthetic Java bugs.
We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
 arXiv  Detail & Related papers  (2022-07-21T20:17:53Z)
- On Distribution Shift in Learning-based Bug Detectors [4.511923587827301]
 We train a bug detector in two phases, first on a synthetic bug distribution to adapt the model to the bug detection domain, and then on a real bug distribution to drive the model towards the real distribution.
We evaluate our approach extensively on three widely studied bug types, for which we construct new datasets carefully designed to capture the real bug distribution.
 arXiv  Detail & Related papers  (2022-04-21T12:17:22Z)
- Detecting Rewards Deterioration in Episodic Reinforcement Learning [63.49923393311052]
 In many RL applications, once training ends, it is vital to detect any deterioration in the agent performance as soon as possible.
We consider an episodic framework, where the rewards within each episode are not independent, nor identically-distributed, nor Markov.
We define the mean-shift in a way corresponding to deterioration of a temporal signal (such as the rewards), and derive a test for this problem with optimal statistical power.
 arXiv  Detail & Related papers  (2020-10-22T12:45:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.