Related papers: An Empirical Study of the Realism of Mutants in Deep Learning

An Empirical Study of the Realism of Mutants in Deep Learning

URL: http://arxiv.org/abs/2512.16741v1
Date: Thu, 18 Dec 2025 16:37:50 GMT
Title: An Empirical Study of the Realism of Mutants in Deep Learning
Authors: Zaheed Ahmed, Philip Makedonski, Jens Grabowski,
Abstract summary: This study presents the first empirical comparison of pre-training and post-training mutation approaches in deep learning.<n>Results show that pre-training mutants exhibit consistently stronger coupling and higher behavioral similarity to real faults than post-training mutants.
Score: 0.34410212782758043
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mutation analysis is a well-established technique for assessing test quality in the traditional software development paradigm by injecting artificial faults into programs. Its application to deep learning (DL) has expanded beyond classical testing to support tasks such as fault localization, repair, data generation, and model robustness evaluation. The core assumption is that mutants behave similarly to real faults, an assumption well established in traditional software systems but largely unverified for DL. This study presents the first empirical comparison of pre-training and post-training mutation approaches in DL with respect to realism. We introduce a statistical framework to quantify their coupling strength and behavioral similarity to real faults using publicly available bugs datasets: CleanML, DeepFD, DeepLocalize, and defect4ML. Mutants are generated using state-of-the-art tools representing both approaches. Results show that pre-training mutants exhibit consistently stronger coupling and higher behavioral similarity to real faults than post-training mutants, indicating greater realism. However, the substantial computational cost of pre-training mutation underscores the need for more effective post-training operators that match or exceed the realism demonstrated by pre-training mutants.

Related papers

Physics-Informed Gaussian Process Regression for the Constitutive Modeling of Concrete: A Data-Driven Improvement to Phenomenological Models [15.576831245374906]
This work develops a physics-informed framework that retains the modular elastoplastic structure of the Karagozian & Case concrete model.<n>It replaces the empirical failure surface with a constrained Gaussian Process Regression surrogate that can be learned directly from experimentally accessible observables.<n>Results show that an unconstrained GPR interpolates well near training conditions but deteriorates and violates essential physical constraints under extrapolation.
arXiv Detail & Related papers (2026-01-06T19:09:40Z)
How and Why LLMs Generalize: A Fine-Grained Analysis of LLM Reasoning from Cognitive Behaviors to Low-Level Patterns [51.02752099869218]
Large Language Models (LLMs) display strikingly different generalization behaviors.<n>We introduce a novel benchmark that decomposes reasoning into atomic core skills.<n>We show that RL-tuned models maintain more stable behavioral profiles and resist collapse in reasoning skills, whereas SFT models exhibit sharper drift and overfit to surface patterns.
arXiv Detail & Related papers (2025-12-30T08:16:20Z)
WITNESS: A lightweight and practical approach to fine-grained predictive mutation testing [22.980743296712856]
WITNESS is a new fine-grained predictive mutation testing approach.<n>It uses lightweight classical machine learning models for training and prediction.<n>It consistently achieves state-of-the-art predictive performance across different scenarios.
arXiv Detail & Related papers (2025-11-15T02:38:00Z)
Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes [55.2480439325792]
Reinforcement learning (RL) has proven remarkably effective at improving the accuracy of language models in verifiable and deterministic domains like mathematics.<n>Here, we examine if current RL methods are also effective at optimizing language models in verifiable domains with outcomes, like scientific experiments.
arXiv Detail & Related papers (2025-08-15T20:50:53Z)
Robust Molecular Property Prediction via Densifying Scarce Labeled Data [53.24886143129006]
In drug discovery, compounds most critical for advancing research often lie beyond the training set.<n>We propose a novel bilevel optimization approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data.
arXiv Detail & Related papers (2025-06-13T15:27:40Z)
Consistency-based Abductive Reasoning over Perceptual Errors of Multiple Pre-trained Models in Novel Environments [5.5855749614100825]
This paper addresses the hypothesis that leveraging multiple pre-trained models can mitigate this recall reduction.<n>We formulate the challenge of identifying and managing conflicting predictions from various models as a consistency-based abduction problem.<n>Our results validate the use of consistency-based abduction as an effective mechanism to robustly integrate knowledge from multiple imperfect models in challenging, novel scenarios.
arXiv Detail & Related papers (2025-05-25T23:17:47Z)
Demystifying amortized causal discovery with transformers [21.058343547918053]
Supervised learning approaches for causal discovery from observational data often achieve competitive performance.<n>In this work, we investigate CSIvA, a transformer-based model promising to train on synthetic data and transfer to real data.<n>We bridge the gap with existing identifiability theory and show that constraints on the training data distribution implicitly define a prior on the test observations.
arXiv Detail & Related papers (2024-05-27T08:17:49Z)
Investigating the Robustness of Counterfactual Learning to Rank Models: A Reproducibility Study [71.04084063541777]
Counterfactual learning to rank has attracted extensive attention in the IR community.<n>Models can be theoretically unbiased when the user behavior assumption is correct and the propensity estimation is accurate.<n>Their effectiveness is usually empirically evaluated via simulation-based experiments due to a lack of widely available, large-scale, real click logs.
arXiv Detail & Related papers (2024-04-04T10:54:38Z)
Variance of ML-based software fault predictors: are we really improving fault prediction? [0.3222802562733786]
We experimentally analyze the variance of a state-of-the-art fault prediction approach. We observed a maximum variance of 10.10% in terms of the per-class accuracy metric.
arXiv Detail & Related papers (2023-10-26T09:31:32Z)
TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks. We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework. TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z)
Modeling Uncertain Feature Representation for Domain Generalization [49.129544670700525]
We show that our method consistently improves the network generalization ability on multiple vision tasks. Our methods are simple yet effective and can be readily integrated into networks without additional trainable parameters or loss constraints.
arXiv Detail & Related papers (2023-01-16T14:25:02Z)
Improving Maximum Likelihood Training for Text Generation with Density Ratio Estimation [51.091890311312085]
We propose a new training scheme for auto-regressive sequence generative models, which is effective and stable when operating at large sample space encountered in text generation. Our method stably outperforms Maximum Likelihood Estimation and other state-of-the-art sequence generative models in terms of both quality and diversity.
arXiv Detail & Related papers (2020-07-12T15:31:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.