Related papers: DevMuT: Testing Deep Learning Framework via Developer Expertise-Based Mutation

DevMuT: Testing Deep Learning Framework via Developer Expertise-Based Mutation

URL: http://arxiv.org/abs/2507.04360v1
Date: Sun, 06 Jul 2025 11:48:04 GMT
Title: DevMuT: Testing Deep Learning Framework via Developer Expertise-Based Mutation
Authors: Yanzhou Mu, Juan Zhai, Chunrong Fang, Xiang Chen, Zhixiang Cao, Peiran Yang, Yinglong Zou, Tao Zheng, Zhenyu Chen,
Abstract summary: DevMuT simulates developers'common operations in development and detects more diverse defects.<n>It can achieve at least 71.68% improvement on average in the diversity of generated models.<n>DevMuT has been deployed in the MindSpore community since December 2023.
Score: 15.407978476058483
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning (DL) frameworks are the fundamental infrastructure for various DL applications. Framework defects can profoundly cause disastrous accidents, thus requiring sufficient detection. In previous studies, researchers adopt DL models as test inputs combined with mutation to generate more diverse models. Though these studies demonstrate promising results, most detected defects are considered trivial (i.e., either treated as edge cases or ignored by the developers). To identify important bugs that matter to developers, we propose a novel DL framework testing method DevMuT, which generates models by adopting mutation operators and constraints derived from developer expertise. DevMuT simulates developers'common operations in development and detects more diverse defects within more stages of the DL model lifecycle (e.g., model training and inference). We evaluate the performance of DevMuT on three widely used DL frameworks (i.e., PyTorch, JAX, and Mind- Spore) with 29 DL models from nine types of industry tasks. The experiment results show that DevMuT outperforms state-of-the-art baselines: it can achieve at least 71.68% improvement on average in the diversity of generated models and 28.20% improvement on average in the legal rates of generated models. Moreover, DevMuT detects 117 defects, 63 of which are confirmed, 24 are fixed, and eight are of high value confirmed by developers. Finally, DevMuT has been deployed in the MindSpore community since December 2023. These demonstrate the effectiveness of DevMuT in detecting defects that are close to the real scenes and are of concern to developers.

Related papers

Imitation Game: Reproducing Deep Learning Bugs Leveraging an Intelligent Agent [6.992405861720876]
RepGen is a novel, automated, and intelligent approach for reproducing deep learning bugs.<n>We evaluate RepGen on 106 real-world deep learning bugs and achieve a reproduction rate of 80.19%.
arXiv Detail & Related papers (2025-12-17T00:50:58Z)
BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills [59.003563837981886]
High quality bugs are key to training the next generation of language model based software engineering (SWE) agents.<n>We introduce a novel method for synthetic generation of difficult and diverse bugs.
arXiv Detail & Related papers (2025-10-22T17:58:56Z)
BugScope: Learn to Find Bugs Like Human [9.05553442116139]
BugScope emulates how human auditors learn new bug patterns from representative examples and apply that knowledge during code auditing.<n>Our evaluation on a dataset of 40 real-world bugs drawn from 21 widely-used open-source projects demonstrates that BugScope achieves 87.04% precision.<n>Further testing on large-scale open-source systems, including the Linux kernel, uncovered 141 previously unknown bugs.
arXiv Detail & Related papers (2025-07-21T14:34:01Z)
Deep Learning Framework Testing via Model Mutation: How Far Are We? [30.292791319442404]
We revisit the defect detection ability of existing mutation-based testing methods.<n>We identify 39 unique defects across just 23 models, of which 31 were confirmed by developers, and eight have been fixed.
arXiv Detail & Related papers (2025-06-21T08:44:33Z)
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing [84.16442052968615]
We introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE)<n>RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning.<n>We conduct experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models.
arXiv Detail & Related papers (2025-04-03T17:59:56Z)
An Empirical Study on Eliciting and Improving R1-like Reasoning Models [90.52239241349504]
scaling RL training has become a central technique for implementing such reasoning models.<n>We demonstrate that our RL training approach consistently improves the Qwen2.5-32B base models.<n>We also explore the use of tool manipulation, finding that it significantly boosts the reasoning performance of large reasoning models.
arXiv Detail & Related papers (2025-03-06T15:34:27Z)
Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.<n>However, improvement is plateauing due to the exhaustion of readily available high-quality data.<n>We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoning [88.68573198200698]
We introduce ExploreToM, the first framework to allow large-scale generation of diverse and challenging theory of mind data.<n>Our approach leverages an A* search over a custom domain-specific language to produce complex story structures and novel, diverse, yet plausible scenarios.<n>Our evaluation reveals that state-of-the-art LLMs, such as Llama-3.1-70B and GPT-4o, show accuracies as low as 0% and 9% on ExploreToM-generated data.
arXiv Detail & Related papers (2024-12-12T21:29:00Z)
CITADEL: Context Similarity Based Deep Learning Framework Bug Finding [36.34154201748415]
Existing deep learning (DL) framework testing tools have limited coverage on bug types.<n>We propose Citadel, a method that accelerates the finding of bugs in terms of efficiency and effectiveness.
arXiv Detail & Related papers (2024-06-18T01:51:16Z)
Enhancing Multiple Reliability Measures via Nuisance-extended Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition. We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training. We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z)
NeuRI: Diversifying DNN Generation via Inductive Rule Inference [16.463237407360594]
NeuRI is a fully automated approach for generating valid and diverse Deep Learning models. NeuRI improves branch coverage of PyTorch by 24% and 15% over the state-of-the-art model-level fuzzers.
arXiv Detail & Related papers (2023-02-04T23:42:07Z)
An Empirical Study of Deep Learning Models for Vulnerability Detection [4.243592852049963]
We surveyed and reproduced 9 state-of-the-art deep learning models on 2 widely used vulnerability detection datasets. We investigated model capabilities, training data, and model interpretation. Our findings can help better understand model results, provide guidance on preparing training data, and improve the robustness of the models.
arXiv Detail & Related papers (2022-12-15T19:49:34Z)
Finding Deep-Learning Compilation Bugs with NNSmith [20.082492391396933]
We propose a new fuzz testing approach for finding bugs in deep-learning compilers. Our core approach uses (i) light-weight operator specifications to generate diverse yet valid models, (ii) a gradient-based search process, and (iii) differential testing to identify bugs. We implemented this approach in NNSmith which has found 65 new bugs in the last seven months for TVM,RT, ONNXRuntime, and PyTorch. Of these 52 have been confirmed and 44 have been fixed by maintainers.
arXiv Detail & Related papers (2022-07-26T17:39:51Z)
DirectDebug: Automated Testing and Debugging of Feature Models [55.41644538483948]
Variability models (e.g., feature models) are a common way for the representation of variabilities and commonalities of software artifacts. Complex and often large-scale feature models can become faulty, i.e., do not represent the expected variability properties of the underlying software artifact.
arXiv Detail & Related papers (2021-02-11T11:22:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.