Related papers: Evaluating the Impact of Flaky Simulators on Testing Autonomous Driving Systems

Evaluating the Impact of Flaky Simulators on Testing Autonomous Driving Systems

URL: http://arxiv.org/abs/2311.18768v1
Date: Thu, 30 Nov 2023 18:08:02 GMT
Title: Evaluating the Impact of Flaky Simulators on Testing Autonomous Driving Systems
Authors: Mohammad Hossein Amini, Shervin Naseri, Shiva Nejati
Abstract summary: We investigate test flakiness in simulation-based testing of Autonomous Driving Systems (ADS) We show that test flakiness in ADS is a common occurrence and can significantly impact the test results obtained by randomized algorithms. Our machine learning (ML) classifiers effectively identify flaky ADS tests using only a single test run.
Score: 2.291478393584594
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Simulators are widely used to test Autonomous Driving Systems (ADS), but their potential flakiness can lead to inconsistent test results. We investigate test flakiness in simulation-based testing of ADS by addressing two key questions: (1) How do flaky ADS simulations impact automated testing that relies on randomized algorithms? and (2) Can machine learning (ML) effectively identify flaky ADS tests while decreasing the required number of test reruns? Our empirical results, obtained from two widely-used open-source ADS simulators and five diverse ADS test setups, show that test flakiness in ADS is a common occurrence and can significantly impact the test results obtained by randomized algorithms. Further, our ML classifiers effectively identify flaky ADS tests using only a single test run, achieving F1-scores of $85$%, $82$% and $96$% for three different ADS test setups. Our classifiers significantly outperform our non-ML baseline, which requires executing tests at least twice, by $31$%, $21$%, and $13$% in F1-score performance, respectively. We conclude with a discussion on the scope, implications and limitations of our study. We provide our complete replication package in a Github repository.

Related papers

Fine-grained Testing for Autonomous Driving Software: a Study on Autoware with LLM-driven Unit Testing [12.067489008051208]
We present the first study on testing, specifically unit testing, for autonomous driving systems (ADS) source code. We analyze both human-written test cases and those generated by large language models (LLMs) We propose AwTest-LLM, a novel approach to enhance test coverage and improve test case pass rates across Autoware packages.
arXiv Detail & Related papers (2025-01-16T22:36:00Z)
DriveTester: A Unified Platform for Simulation-Based Autonomous Driving Testing [24.222344794923558]
DriveTester is a unified simulation-based testing platform built on Apollo. It provides a consistent and reliable environment, integrates a lightweight traffic simulator, and incorporates various state-of-the-art ADS testing techniques.
arXiv Detail & Related papers (2024-12-17T08:24:05Z)
Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANA [47.29324864511411]
Flaky tests fail seemingly at random without changes to the code. We study characteristics of tests and the test environment that potentially impact test flakiness.
arXiv Detail & Related papers (2024-09-16T07:52:09Z)
LLM-Powered Test Case Generation for Detecting Tricky Bugs [30.82169191775785]
AID generates test inputs and oracles targeting plausibly correct programs. We evaluate AID on two large-scale datasets with tricky bugs: TrickyBugs and EvalPlus. The evaluation results show that the recall, precision, and F1 score of AID outperform the state-of-the-art by up to 1.80x, 2.65x, and 1.66x, respectively.
arXiv Detail & Related papers (2024-04-16T06:20:06Z)
Investigating the Robustness of Counterfactual Learning to Rank Models: A Reproducibility Study [61.64685376882383]
Counterfactual learning to rank (CLTR) has attracted extensive attention in the IR community for its ability to leverage massive logged user interaction data to train ranking models. This paper investigates the robustness of existing CLTR models in complex and diverse situations. We find that the DLA models and IPS-DCM show better robustness under various simulation settings than IPS-PBM and PRS with offline propensity estimation.
arXiv Detail & Related papers (2024-04-04T10:54:38Z)
MultiTest: Physical-Aware Object Insertion for Testing Multi-sensor Fusion Perception Systems [23.460181958075566]
Multi-sensor fusion (MSF) is a key technique in addressing numerous safety-critical tasks and applications, e.g., self-driving cars and automated robotic arms. Existing testing methods primarily concentrate on single-sensor perception systems. We introduce MultiTest, a fitness-guided metamorphic testing method for complex MSF perception systems.
arXiv Detail & Related papers (2024-01-25T17:03:02Z)
Test Generation Strategies for Building Failure Models and Explaining Spurious Failures [4.995172162560306]
Test inputs fail not only when the system under test is faulty but also when the inputs are invalid or unrealistic. We propose to build failure models for inferring interpretable rules on test inputs that cause spurious failures. We show that our proposed surrogate-assisted approach generates failure models with an average accuracy of 83%.
arXiv Detail & Related papers (2023-12-09T18:36:15Z)
Sequential Kernelized Independence Testing [101.22966794822084]
We design sequential kernelized independence tests inspired by kernelized dependence measures. We demonstrate the power of our approaches on both simulated and real data.
arXiv Detail & Related papers (2022-12-14T18:08:42Z)
AutoML Two-Sample Test [13.468660785510945]
We use a simple test that takes the mean discrepancy of a witness function as the test statistic and prove that minimizing a squared loss leads to a witness with optimal testing power. We provide an implementation of the AutoML two-sample test in the Python package autotst.
arXiv Detail & Related papers (2022-06-17T15:41:07Z)
TTAPS: Test-Time Adaption by Aligning Prototypes using Self-Supervision [70.05605071885914]
We propose a novel modification of the self-supervised training algorithm SwAV that adds the ability to adapt to single test samples. We show the success of our method on the common benchmark dataset CIFAR10-C.
arXiv Detail & Related papers (2022-05-18T05:43:06Z)
Listen, Adapt, Better WER: Source-free Single-utterance Test-time Adaptation for Automatic Speech Recognition [65.84978547406753]
Test-time Adaptation aims to adapt the model trained on source domains to yield better predictions for test samples. Single-Utterance Test-time Adaptation (SUTA) is the first TTA study in speech area to our best knowledge.
arXiv Detail & Related papers (2022-03-27T06:38:39Z)
Digital Twins Are Not Monozygotic -- Cross-Replicating ADAS Testing in Two Industry-Grade Automotive Simulators [13.386879259549305]
We show that SBST can be used to effectively and efficiently generate critical test scenarios in two simulators. We find that executing the same test scenarios in the two simulators leads to notable differences in the details of the test outputs.
arXiv Detail & Related papers (2020-12-12T14:00:33Z)
Noisy Adaptive Group Testing using Bayesian Sequential Experimental Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually. Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.