Related papers: Distribution Awareness for AI System Testing

Related papers

Rethinking Verification for LLM Code Generation: From Generation to Testing [44.46778801679273]
Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench.<n>We propose a new multi-dimensional metrics designed to rigorously quantify test-suite.<n> Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench.
arXiv Detail & Related papers (2025-07-09T14:58:47Z)
ABFS: Natural Robustness Testing for LLM-based NLP Software [8.833542944724465]
Large Language Models (LLMs) in Natural Language Processing (NLP) software has rapidly gained traction across various domains. These applications frequently exhibit robustness deficiencies, where slight perturbations in input may lead to erroneous outputs. Current robustness testing methods face two main limitations: (1) low testing effectiveness, and (2) insufficient naturalness of test cases.
arXiv Detail & Related papers (2025-03-03T09:02:06Z)
What You See Is What You Get: Attention-based Self-guided Automatic Unit Test Generation [3.8244417073114003]
We propose Attention-based Self-guided Automatic Unit Test GenERation (AUGER) approach. AUGER contains two stages: defect detection and error triggering. It makes great improvements by 4.7% to 35.3% in terms of F1-score and Precision in defect detection. It can trigger 23 to 84 more errors than state-of-the-art (SOTA) approaches in unit test generation.
arXiv Detail & Related papers (2024-12-01T14:28:48Z)
StagedVulBERT: Multi-Granular Vulnerability Detection with a Novel Pre-trained Code Model [13.67394549308693]
This study introduces StagedVulBERT, a novel vulnerability detection framework. CodeBERT-HLS component is designed to capture semantics at both the token and statement levels simultaneously. In coarse-grained vulnerability detection, StagedVulBERT achieves an F1 score of 92.26%, marking a 6.58% improvement over the best-performing methods.
arXiv Detail & Related papers (2024-10-08T07:46:35Z)
Leveraging Large Language Models for Efficient Failure Analysis in Game Development [47.618236610219554]
This paper proposes a new approach to automatically identify which change in the code caused a test to fail. The method leverages Large Language Models (LLMs) to associate error messages with the corresponding code changes causing the failure. Our approach reaches an accuracy of 71% in our newly created dataset, which comprises issues reported by developers at EA over a period of one year.
arXiv Detail & Related papers (2024-06-11T09:21:50Z)
Towards Automatic Generation of Amplified Regression Test Oracles [44.45138073080198]
We propose a test oracle derivation approach to amplify regression test oracles. The approach monitors the object state during test execution and compares it to the previous version to detect any changes in relation to the SUT's intended behaviour.
arXiv Detail & Related papers (2023-07-28T12:38:44Z)
Free Lunch for Generating Effective Outlier Supervision [46.37464572099351]
We propose an ultra-effective method to generate near-realistic outlier supervision. Our proposed textttBayesAug significantly reduces the false positive rate over 12.50% compared with the previous schemes.
arXiv Detail & Related papers (2023-01-17T01:46:45Z)
Towards a Fair Comparison and Realistic Design and Evaluation Framework of Android Malware Detectors [63.75363908696257]
We analyze 10 influential research works on Android malware detection using a common evaluation framework. We identify five factors that, if not taken into account when creating datasets and designing detectors, significantly affect the trained ML models. We conclude that the studied ML-based detectors have been evaluated optimistically, which justifies the good published results.
arXiv Detail & Related papers (2022-05-25T08:28:08Z)
SUPERNOVA: Automating Test Selection and Defect Prevention in AAA Video Games Using Risk Based Testing and Machine Learning [62.997667081978825]
Testing video games is an increasingly difficult task as traditional methods fail to scale with growing software systems. We present SUPERNOVA, a system responsible for test selection and defect prevention while also functioning as an automation hub. The direct impact of this has been observed to be a reduction in 55% or more testing hours for an undisclosed sports game title.
arXiv Detail & Related papers (2022-03-10T00:47:46Z)
A high performance fingerprint liveness detection method based on quality related features [66.41574316136379]
The system is tested on a highly challenging database comprising over 10,500 real and fake images. The proposed solution proves to be robust to the multi-scenario dataset, and presents an overall rate of 90% correctly classified samples.
arXiv Detail & Related papers (2021-11-02T21:09:39Z)
Leveraging Uncertainty for Improved Static Malware Detection Under Extreme False Positive Constraints [21.241478970181912]
We show how ensembling and Bayesian treatments of machine learning methods for static malware detection allow for improved identification of model errors. In particular, we improve the true positive rate (TPR) at an actual realized FPR of 1e-5 from an expected 0.69 for previous methods to 0.80 on the best performing model class on the Sophos industry scale dataset.
arXiv Detail & Related papers (2021-08-09T14:30:23Z)
Detecting Operational Adversarial Examples for Reliable Deep Learning [12.175315224450678]
We present the novel notion of "operational AEs" which are AEs that have relatively high chance to be seen in future operation. An initial design of a new DL testing method to efficiently detect "operational AEs" is provided.
arXiv Detail & Related papers (2021-04-13T08:31:42Z)
Reinforcement Learning for Test Case Prioritization [0.24366811507669126]
This paper extends recent studies on applying Reinforcement Learning to optimize testing strategies. We test its ability to adapt to new environments, by testing it on novel data extracted from a financial institution. We also studied the impact of using Decision Tree (DT) Approximator as a model for memory representation.
arXiv Detail & Related papers (2020-12-18T11:08:20Z)
Learn what you can't learn: Regularized Ensembles for Transductive Out-of-distribution Detection [76.39067237772286]
We show that current out-of-distribution (OOD) detection algorithms for neural networks produce unsatisfactory results in a variety of OOD detection scenarios. This paper studies how such "hard" OOD scenarios can benefit from adjusting the detection method after observing a batch of the test data. We propose a novel method that uses an artificial labeling scheme for the test data and regularization to obtain ensembles of models that produce contradictory predictions only on the OOD samples in a test batch.
arXiv Detail & Related papers (2020-12-10T16:55:13Z)
NADS: Neural Architecture Distribution Search for Uncertainty Awareness [79.18710225716791]
Machine learning (ML) systems often encounter Out-of-Distribution (OoD) errors when dealing with testing data coming from a distribution different from training data. Existing OoD detection approaches are prone to errors and even sometimes assign higher likelihoods to OoD samples. We propose Neural Architecture Distribution Search (NADS) to identify common building blocks among all uncertainty-aware architectures.
arXiv Detail & Related papers (2020-06-11T17:39:07Z)
Towards Characterizing Adversarial Defects of Deep Learning Software from the Lens of Uncertainty [30.97582874240214]
Adversarial examples (AEs) represent a typical and important type of defects needed to be urgently addressed. The intrinsic uncertainty nature of deep learning decisions can be a fundamental reason for its incorrect behavior. We identify and categorize the uncertainty patterns of benign examples (BEs) and AEs, and find that while BEs and AEs generated by existing methods do follow common uncertainty patterns, some other uncertainty patterns are largely missed.
arXiv Detail & Related papers (2020-04-24T07:29:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.