Deep Learning Framework Testing via Model Mutation: How Far Are We?
- URL: http://arxiv.org/abs/2506.17638v2
- Date: Mon, 07 Jul 2025 16:10:52 GMT
- Title: Deep Learning Framework Testing via Model Mutation: How Far Are We?
- Authors: Yanzhou Mu, Rong Wang, Juan Zhai, Chunrong Fang, Xiang Chen, Zhiyuan Peng, Peiran Yang, Ruixiang Qian, Shaoyu Yang, Zhenyu Chen,
- Abstract summary: We revisit the defect detection ability of existing mutation-based testing methods.<n>We identify 39 unique defects across just 23 models, of which 31 were confirmed by developers, and eight have been fixed.
- Score: 30.292791319442404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep Learning (DL) frameworks are a fundamental component of DL development. Therefore, the detection of DL framework defects is important and challenging. As one of the most widely adopted DL testing techniques, model mutation has recently gained significant attention. In this study, we revisit the defect detection ability of existing mutation-based testing methods and investigate the factors that influence their effectiveness. To begin with, we reviewed existing methods and observed that many of them mutate DL models (e.g., changing their parameters) without any customization, ignoring the unique challenges in framework testing. Another issue with these methods is their limited effectiveness, characterized by a high rate of false positives caused by illegal mutations arising from the use of generic, non-customized mutation operators. Moreover, we tracked the defects identified by these methods and discovered that most of them were ignored by developers. Motivated by these observations, we investigate the effectiveness of existing mutation-based testing methods in detecting important defects that have been authenticated by framework developers. We begin by collecting defect reports from three popular frameworks and classifying them based on framework developers' ratings to build a comprehensive dataset. We then perform an in-depth analysis to uncover valuable insights. Based on our findings, we propose optimization strategies to address the shortcomings of existing approaches. Following these optimizations, we identified seven new defects, four of which were confirmed by developers as high-priority issues, with three resolved. In summary, we identified 39 unique defects across just 23 models, of which 31 were confirmed by developers, and eight have been fixed.
Related papers
- Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes [16.451488374845407]
We present a novel framework addressing a critical vulnerability in Large Language Models (LLMs)<n>This phenomenon poses substantial risks in high-stakes domains including healthcare, legal analysis, and scientific research.
arXiv Detail & Related papers (2025-07-25T10:34:51Z) - It Only Gets Worse: Revisiting DL-Based Vulnerability Detectors from a Practical Perspective [14.271145160443462]
VulTegra compares scratch-trained and pre-trained DL models for vulnerability detection.<n>State-of-the-art (SOTA) detectors still suffer from low consistency, limited real-world capabilities, and scalability challenges.
arXiv Detail & Related papers (2025-07-13T08:02:56Z) - DevMuT: Testing Deep Learning Framework via Developer Expertise-Based Mutation [15.407978476058483]
DevMuT simulates developers'common operations in development and detects more diverse defects.<n>It can achieve at least 71.68% improvement on average in the diversity of generated models.<n>DevMuT has been deployed in the MindSpore community since December 2023.
arXiv Detail & Related papers (2025-07-06T11:48:04Z) - A Framework for Creating Non-Regressive Test Cases via Branch Consistency Analysis Driven by Descriptions [9.141981611891715]
DISTINCT is a Description-guided, branch-consistency analysis framework.<n>It transforms Large Language Model (LLM)-based generators into fault-aware test generators.<n>It achieves an average improvement of 14.64% in Compilation Success Rate (CSR) and 6.66% in Passing Rate (PR)
arXiv Detail & Related papers (2025-06-09T07:05:48Z) - Data Fusion for Partial Identification of Causal Effects [62.56890808004615]
We propose a novel partial identification framework that enables researchers to answer key questions.<n>Is the causal effect positive or negative? and How severe must assumption violations be to overturn this conclusion?<n>We apply our framework to the Project STAR study, which investigates the effect of classroom size on students' third-grade standardized test performance.
arXiv Detail & Related papers (2025-05-30T07:13:01Z) - Triad: Empowering LMM-based Anomaly Detection with Vision Expert-guided Visual Tokenizer and Manufacturing Process [67.99194145865165]
We modify the AnyRes structure of the LLaVA model to provide the potential anomalous areas identified by existing IAD models to the LMMs.<n>Considering that the generation of defects is closely related to the manufacturing process, we propose a manufacturing-driven IAD paradigm.<n>We present Triad, a novel LMM-based method incorporating an expert-guided region-of-interest tokenizer and manufacturing process.
arXiv Detail & Related papers (2025-03-17T13:56:57Z) - On the Within-class Variation Issue in Alzheimer's Disease Detection [60.08015780474457]
Alzheimer's Disease (AD) detection employs machine learning classification models to distinguish between individuals with AD and those without.<n>In this work, we found using a sample score estimator can generate sample-specific soft scores aligning with cognitive scores.<n>We propose two simple yet effective methods: Soft Target Distillation (SoTD) and Instance-level Re-balancing (InRe)
arXiv Detail & Related papers (2024-09-22T02:06:05Z) - An Exploratory Study on Using Large Language Models for Mutation Testing [32.91472707292504]
Large Language Models (LLMs) have shown great potential in code-related tasks but their utility in mutation testing remains unexplored.
This paper investigates the performance of LLMs in generating effective mutations to their usability, fault detection potential, and relationship with real bugs.
We find that compared to existing approaches, LLMs generate more diverse mutations that are behaviorally closer to real bugs.
arXiv Detail & Related papers (2024-06-14T08:49:41Z) - Semi-supervised Anomaly Detection via Adaptive Reinforcement Learning-Enabled Method with Causal Inference for Sensor Signals [15.249261198557218]
Semi-supervised anomaly detection for sensor signals is critical in ensuring system reliability in smart manufacturing.
This paper innovatively constructs a counterfactual causal reinforcement learning model, termed Triple-Assisted Causal Reinforcement Learning Anomaly Detector (Tri-CRLAD)
Experimental results across seven diverse sensor signal datasets demonstrate that Tri-CRLAD outperforms nine state-of-the-art baseline methods.
arXiv Detail & Related papers (2024-05-11T06:10:05Z) - Unified Uncertainty Estimation for Cognitive Diagnosis Models [70.46998436898205]
We propose a unified uncertainty estimation approach for a wide range of cognitive diagnosis models.
We decompose the uncertainty of diagnostic parameters into data aspect and model aspect.
Our method is effective and can provide useful insights into the uncertainty of cognitive diagnosis.
arXiv Detail & Related papers (2024-03-09T13:48:20Z) - Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection [59.41026558455904]
We focus on multi-modal anomaly detection. Specifically, we investigate early multi-modal approaches that attempted to utilize models pre-trained on large-scale visual datasets.
We propose a Local-to-global Self-supervised Feature Adaptation (LSFA) method to finetune the adaptors and learn task-oriented representation toward anomaly detection.
arXiv Detail & Related papers (2024-01-06T07:30:41Z) - A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of
LLMs by Validating Low-Confidence Generation [76.34411067299331]
Large language models often tend to 'hallucinate' which critically hampers their reliability.
We propose an approach that actively detects and mitigates hallucinations during the generation process.
We show that the proposed active detection and mitigation approach successfully reduces the hallucinations of the GPT-3.5 model from 47.5% to 14.5% on average.
arXiv Detail & Related papers (2023-07-08T14:25:57Z) - On Pitfalls of Test-Time Adaptation [82.8392232222119]
Test-Time Adaptation (TTA) has emerged as a promising approach for tackling the robustness challenge under distribution shifts.
We present TTAB, a test-time adaptation benchmark that encompasses ten state-of-the-art algorithms, a diverse array of distribution shifts, and two evaluation protocols.
arXiv Detail & Related papers (2023-06-06T09:35:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.