Related papers: Towards Understanding the Impact of Data Bugs on Deep Learning Models in Software Engineering

Towards Understanding the Impact of Data Bugs on Deep Learning Models in Software Engineering

URL: http://arxiv.org/abs/2411.12137v1
Date: Tue, 19 Nov 2024 00:28:20 GMT
Title: Towards Understanding the Impact of Data Bugs on Deep Learning Models in Software Engineering
Authors: Mehil B Shah, Mohammad Masudur Rahman, Foutse Khomh,
Abstract summary: Deep learning (DL) systems are prone to bugs from many sources, including training data. Existing literature suggests that bugs in training data are highly prevalent. We investigate three types of data prevalent in software engineering tasks: code-based, text-based, and metric-based.
Score: 13.17302533571231
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep learning (DL) techniques have achieved significant success in various software engineering tasks (e.g., code completion by Copilot). However, DL systems are prone to bugs from many sources, including training data. Existing literature suggests that bugs in training data are highly prevalent, but little research has focused on understanding their impacts on the models used in software engineering tasks. In this paper, we address this research gap through a comprehensive empirical investigation focused on three types of data prevalent in software engineering tasks: code-based, text-based, and metric-based. Using state-of-the-art baselines, we compare the models trained on clean datasets with those trained on datasets with quality issues and without proper preprocessing. By analysing the gradients, weights, and biases from neural networks under training, we identify the symptoms of data quality and preprocessing issues. Our analysis reveals that quality issues in code data cause biased learning and gradient instability, whereas problems in text data lead to overfitting and poor generalisation of models. On the other hand, quality issues in metric data result in exploding gradients and model overfitting, and inadequate preprocessing exacerbates these effects across all three data types. Finally, we demonstrate the validity and generalizability of our findings using six new datasets. Our research provides a better understanding of the impact and symptoms of data bugs in software engineering datasets. Practitioners and researchers can leverage these findings to develop better monitoring systems and data-cleaning methods to help detect and resolve data bugs in deep learning systems.

Related papers

Privacy-Preserving Methods for Bug Severity Prediction [0.0]
We investigate method-level bug severity prediction using source code metrics and Large Language Models.<n>We compare the performance of models trained using centralized learning, federated learning, and synthetic data generation.<n>Our finding highlights the potential of privacy-preserving approaches to enable effective bug severity prediction in industrial context.
arXiv Detail & Related papers (2025-06-28T04:40:51Z)
From Bugs to Benchmarks: A Comprehensive Survey of Software Defect Datasets [19.140541190998842]
Software defect datasets are collections of software bugs and their associated information. Over the years, numerous software defect datasets have been developed, providing rich resources for the community. This article provides a comprehensive survey of 132 software defect datasets.
arXiv Detail & Related papers (2025-04-24T23:07:04Z)
AI-Driven Frameworks for Enhancing Data Quality in Big Data Ecosystems: Error_Detection, Correction, and Metadata Integration [0.0]
This thesis proposes a novel set of interconnected frameworks aimed at enhancing big data quality comprehensively. Firstly, we introduce new quality metrics and a weighted scoring system for precise data quality assessment. Thirdly, we present a generic framework for detecting various quality anomalies using AI models.
arXiv Detail & Related papers (2024-05-06T21:36:45Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
An Effective Data-Driven Approach for Localizing Deep Learning Faults [20.33411443073181]
We propose a novel data-driven approach that leverages model features to learn problem patterns. Our methodology automatically links bug symptoms to their root causes, without the need for manually crafted mappings. Our results demonstrate that our technique can effectively detect and diagnose different bug types.
arXiv Detail & Related papers (2023-07-18T03:28:39Z)
Advanced Data Augmentation Approaches: A Comprehensive Survey and Future directions [57.30984060215482]
We provide a background of data augmentation, a novel and comprehensive taxonomy of reviewed data augmentation techniques, and the strengths and weaknesses (wherever possible) of each technique. We also provide comprehensive results of the data augmentation effect on three popular computer vision tasks, such as image classification, object detection and semantic segmentation.
arXiv Detail & Related papers (2023-01-07T11:37:32Z)
Towards Robust Dataset Learning [90.2590325441068]
We propose a principled, tri-level optimization to formulate the robust dataset learning problem. Under an abstraction model that characterizes robust vs. non-robust features, the proposed method provably learns a robust dataset.
arXiv Detail & Related papers (2022-11-19T17:06:10Z)
Graph Neural Networks with Trainable Adjacency Matrices for Fault Diagnosis on Multivariate Sensor Data [69.25738064847175]
It is necessary to consider the behavior of the signals in each sensor separately, to take into account their correlation and hidden relationships with each other. The graph nodes can be represented as data from the different sensors, and the edges can display the influence of these data on each other. It was proposed to construct a graph during the training of graph neural network. This allows to train models on data where the dependencies between the sensors are not known in advance.
arXiv Detail & Related papers (2022-10-20T11:03:21Z)
Kubric: A scalable dataset generator [73.78485189435729]
Kubric is a Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines. We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation.
arXiv Detail & Related papers (2022-03-07T18:13:59Z)
Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective [16.480530590466472]
Data-centric AI practices are now becoming mainstream. Many datasets in the real world are small, dirty, biased, and even poisoned. For data quality, we study data validation and data cleaning techniques.
arXiv Detail & Related papers (2021-12-13T03:57:36Z)
Data Curation and Quality Assurance for Machine Learning-based Cyber Intrusion Detection [1.0276024900942873]
This article first summarizes existing machine learning-based intrusion detection systems and the datasets used for building these systems. The experimental results show that BERT and GPT were the best algorithms for HIDS on all of the datasets. We then evaluate the data quality of the 11 datasets based on quality dimensions proposed in this paper to determine the best characteristics that a HIDS dataset should possess in order to yield the best possible result.
arXiv Detail & Related papers (2021-05-20T21:31:46Z)
Hidden Biases in Unreliable News Detection Datasets [60.71991809782698]
We show that selection bias during data collection leads to undesired artifacts in the datasets. We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap. We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.
arXiv Detail & Related papers (2021-04-20T17:16:41Z)
Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise [21.491392581672198]
We present Snoopy, with the goal of supporting data scientists and machine learning engineers performing a systematic and theoretically founded feasibility study. We approach this problem by estimating the irreducible error of the underlying task, also known as the Bayes error rate (BER) We demonstrate in end-to-end experiments how users are able to save substantial labeling time and monetary efforts.
arXiv Detail & Related papers (2020-10-16T14:21:19Z)
How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance. We formulate a quality measure for the data set, which we refer to as $rho$-gap. We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.