Distinguishability-guided Test Program Generation for WebAssembly Runtime Performance Testing
- URL: http://arxiv.org/abs/2412.20100v1
- Date: Sat, 28 Dec 2024 09:51:23 GMT
- Title: Distinguishability-guided Test Program Generation for WebAssembly Runtime Performance Testing
- Authors: Shuyao Jiang, Ruiying Zeng, Yangfan Zhou, Michael R. Lyu,
- Abstract summary: High performance is a critical design goal of WebAssembly (Wasm)
Research on Wasm runtime performance testing still suffers from insufficient high-quality test programs.
In particular, WarpGen has identified seven new performance issues in three Wasm runtimes.
- Score: 28.920256869194315
- License:
- Abstract: WebAssembly (Wasm) is a binary instruction format designed as a portable compilation target, which has been widely used on both the web and server sides in recent years. As high performance is a critical design goal of Wasm, it is essential to conduct performance testing for Wasm runtimes. However, existing research on Wasm runtime performance testing still suffers from insufficient high-quality test programs. To solve this problem, we propose a novel test program generation approach WarpGen. It first extracts code snippets from historical issue-triggering test programs as initial operators, then inserts an operator into a seed program to synthesize a new test program. To verify the quality of generated programs, we propose an indicator called distinguishability, which refers to the ability of a test program to distinguish abnormal performance of specific Wasm runtimes. We apply WarpGen for performance testing on four Wasm runtimes and verify its effectiveness compared with baseline approaches. In particular, WarpGen has identified seven new performance issues in three Wasm runtimes.
Related papers
- Revisit Self-Debugging with Self-Generated Tests for Code Generation [18.643472696246686]
Self-ging with self-generated tests is a promising solution but lacks a full exploration of its limitations and practical potential.
We propose two paradigms for the process: post-execution and in-execution self-ging.
We find that post-execution self-ging struggles on basic problems but shows potential for improvement on competitive ones, due to the bias introduced by self-generated tests.
arXiv Detail & Related papers (2025-01-22T10:54:19Z) - SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents [10.730852617039451]
We investigate the capability of LLM-based Code Agents to formalize user issues into test cases.
We propose a novel benchmark based on popular GitHub repositories, containing real-world issues, ground-truth bug-fixes, and golden tests.
We find that LLMs generally perform surprisingly well at generating relevant test cases, with Code Agents designed for code repair exceeding the performance of systems designed for test generation.
arXiv Detail & Related papers (2024-06-18T14:54:37Z) - NExT: Teaching Large Language Models to Reason about Code Execution [50.93581376646064]
Large language models (LLMs) of code are typically trained on the surface textual form of programs.
We propose NExT, a method to teach LLMs to inspect the execution traces of programs and reason about their run-time behavior.
arXiv Detail & Related papers (2024-04-23T01:46:32Z) - Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution.
TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults.
Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z) - Unit Test Generation using Generative AI : A Comparative Performance
Analysis of Autogeneration Tools [2.0686733932673604]
This research aims to experimentally investigate the effectiveness of Large Language Models (LLMs) for generating unit test scripts for Python programs.
For experiments, we consider three types of code units: 1) Procedural scripts, 2) Function-based modular code, and 3) Class-based code.
Our results show that ChatGPT's performance is comparable with Pynguin in terms of coverage, though for some cases its performance is superior to Pynguin.
arXiv Detail & Related papers (2023-12-17T06:38:11Z) - WRTester: Differential Testing of WebAssembly Runtimes via
Semantic-aware Binary Generation [19.78427170624683]
We present WRTester, a novel differential testing framework that can generated complicated Wasm test cases by disassembling and assembling real-world Wasm binaries.
For further pinpointing the root causes of unexpected behaviors, we design a runtime-agnostic root cause location method to accurately locate bugs.
We have uncovered 33 unique bugs in popular Wasm runtimes, among which 25 have been confirmed.
arXiv Detail & Related papers (2023-12-16T14:02:42Z) - Revealing Performance Issues in Server-side WebAssembly Runtimes via
Differential Testing [28.187405253760687]
We design a novel differential testing approach WarpDiff to identify performance issues in server-side Wasm runtimes.
We identify abnormal cases where the execution time ratio significantly deviates from the oracle ratio and locate the Wasm runtimes that cause the performance issues.
arXiv Detail & Related papers (2023-09-21T15:25:18Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it.
Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z) - Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement
Learning Framework [68.96770035057716]
A/B testing is a business strategy to compare a new product with an old one in pharmaceutical, technological, and traditional industries.
This paper introduces a reinforcement learning framework for carrying A/B testing in online experiments.
arXiv Detail & Related papers (2020-02-05T10:25:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.