Related papers: WEFix: Intelligent Automatic Generation of Explicit Waits for Efficient Web End-to-End Flaky Tests

WEFix: Intelligent Automatic Generation of Explicit Waits for Efficient Web End-to-End Flaky Tests

URL: http://arxiv.org/abs/2402.09745v1
Date: Thu, 15 Feb 2024 06:51:53 GMT
Title: WEFix: Intelligent Automatic Generation of Explicit Waits for Efficient Web End-to-End Flaky Tests
Authors: Xinyue Liu, Zihe Song, Weike Fang, Wei Yang, Weihang Wang
Abstract summary: We propose WEFix, a technique that can automatically generate fix code for UI-based flakiness in web e2e testing. We evaluate the effectiveness and efficiency of WEFix against 122 web e2e flaky tests from seven popular real-world projects.
Score: 13.280540531582945
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Web end-to-end (e2e) testing evaluates the workflow of a web application. It simulates real-world user scenarios to ensure the application flows behave as expected. However, web e2e tests are notorious for being flaky, i.e., the tests can produce inconsistent results despite no changes to the code. One common type of flakiness is caused by nondeterministic execution orders between the test code and the client-side code under test. In particular, UI-based flakiness emerges as a notably prevalent and challenging issue to fix because the test code has limited knowledge about the client-side code execution. In this paper, we propose WEFix, a technique that can automatically generate fix code for UI-based flakiness in web e2e testing. The core of our approach is to leverage browser UI changes to predict the client-side code execution and generate proper wait oracles. We evaluate the effectiveness and efficiency of WEFix against 122 web e2e flaky tests from seven popular real-world projects. Our results show that WEFix dramatically reduces the overhead (from 3.7$\times$ to 1.25$\times$) while achieving a high correctness (98%).

Related papers

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories [59.214178488091584]
We propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents.
arXiv Detail & Related papers (2025-04-11T19:49:22Z)
Studying the Impact of Early Test Termination Due to Assertion Failure on Code Coverage and Spectrum-based Fault Localization [48.22524837906857]
This study is the first empirical study on early test termination due to assertion failure. We investigated 207 versions of 6 open-source projects. Our findings indicate that early test termination harms both code coverage and the effectiveness of spectrum-based fault localization.
arXiv Detail & Related papers (2025-04-06T17:14:09Z)
Otter: Generating Tests from Issues to Validate SWE Patches [12.353105297285802]
This paper introduces Otter, an LLM-based solution for generating tests from issues. Otter augments LLMs with rule-based analysis to check and repair their outputs, and introduces a novel self-reflective action planning stage. Experiments show Otter outperforming state-of-the-art systems for generating tests from issues.
arXiv Detail & Related papers (2025-02-07T22:41:31Z)
AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs. Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z)
TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark [24.14654309612826]
TestGenEval comprises 68,647 tests from 1,210 code and test file pairs across 11 well-maintained Python repositories. It covers initial tests authoring, test suite completion, and code coverage improvements. We evaluate several popular models, with sizes ranging from 7B to 405B parameters.
arXiv Detail & Related papers (2024-10-01T14:47:05Z)
Taming Timeout Flakiness: An Empirical Study of SAP HANA [47.29324864511411]
Flaky tests negatively affect regression testing because they result in test failures that are not necessarily caused by code changes. Test timeouts are one contributing factor to such flaky test failures. Test flakiness rate ranges from 49% to 70%, depending on the number of repeated test executions.
arXiv Detail & Related papers (2024-02-07T20:01:41Z)
Do Automatic Test Generation Tools Generate Flaky Tests? [12.813573907094074]
The prevalence and nature of flaky tests produced by test generation tools remain largely unknown. We generate tests using EvoSuite (Java) and Pynguin (Python) and execute each test 200 times. Our results show that flakiness is at least as common in generated tests as in developer-written tests.
arXiv Detail & Related papers (2023-10-08T16:44:27Z)
Towards Automatic Generation of Amplified Regression Test Oracles [44.45138073080198]
We propose a test oracle derivation approach to amplify regression test oracles. The approach monitors the object state during test execution and compares it to the previous version to detect any changes in relation to the SUT's intended behaviour.
arXiv Detail & Related papers (2023-07-28T12:38:44Z)
Neural Embeddings for Web Testing [49.66745368789056]
Existing crawlers rely on app-specific, threshold-based, algorithms to assess state equivalence. We propose WEBEMBED, a novel abstraction function based on neural network embeddings and threshold-free classifiers. Our evaluation on nine web apps shows that WEBEMBED outperforms state-of-the-art techniques by detecting near-duplicates more accurately.
arXiv Detail & Related papers (2023-06-12T19:59:36Z)
Time-based Repair for Asynchronous Wait Flaky Tests in Web Testing [0.0]
Asynchronous waits are one of the most prevalent root causes of flaky tests in web applications. We propose TRaf, an automated time-based repair method for asynchronous wait flaky tests. Our analysis shows that TRaf can suggest a shorter wait time to resolve the test flakiness compared to developer-written fixes.
arXiv Detail & Related papers (2023-05-15T12:17:30Z)
MT3: Meta Test-Time Training for Self-Supervised Test-Time Adaption [69.76837484008033]
An unresolved problem in Deep Learning is the ability of neural networks to cope with domain shifts during test-time. We combine meta-learning, self-supervision and test-time training to learn to adapt to unseen test distributions. Our approach significantly improves the state-of-the-art results on the CIFAR-10-Corrupted image classification benchmark.
arXiv Detail & Related papers (2021-03-30T09:33:38Z)
Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement Learning Framework [68.96770035057716]
A/B testing is a business strategy to compare a new product with an old one in pharmaceutical, technological, and traditional industries. This paper introduces a reinforcement learning framework for carrying A/B testing in online experiments.
arXiv Detail & Related papers (2020-02-05T10:25:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.