Time to Retrain? Detecting Concept Drifts in Machine Learning Systems
- URL: http://arxiv.org/abs/2410.09190v1
- Date: Fri, 11 Oct 2024 18:47:39 GMT
- Title: Time to Retrain? Detecting Concept Drifts in Machine Learning Systems
- Authors: Tri Minh Triet Pham, Karthikeyan Premkumar, Mohamed Naili, Jinqiu Yang,
- Abstract summary: We propose a model-agnostic technique (CDSeer) for detecting concept drift in machine learning (ML) models.
Results show that CDSeer has better precision and recall compared to the state-of-the-art while requiring significantly less manual labeling.
The improved performance and ease of adoption of CDSeer are valuable in making ML systems more reliable.
- Score: 1.4499463058550683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the boom of machine learning (ML) techniques, software practitioners build ML systems to process the massive volume of streaming data for diverse software engineering tasks such as failure prediction in AIOps. Trained using historical data, such ML models encounter performance degradation caused by concept drift, i.e., data and inter-relationship (concept) changes between training and production. It is essential to use concept rift detection to monitor the deployed ML models and re-train the ML models when needed. In this work, we explore applying state-of-the-art (SOTA) concept drift detection techniques on synthetic and real-world datasets in an industrial setting. Such an industrial setting requires minimal manual effort in labeling and maximal generality in ML model architecture. We find that current SOTA semi-supervised methods not only require significant labeling effort but also only work for certain types of ML models. To overcome such limitations, we propose a novel model-agnostic technique (CDSeer) for detecting concept drift. Our evaluation shows that CDSeer has better precision and recall compared to the state-of-the-art while requiring significantly less manual labeling. We demonstrate the effectiveness of CDSeer at concept drift detection by evaluating it on eight datasets from different domains and use cases. Results from internal deployment of CDSeer on an industrial proprietary dataset show a 57.1% improvement in precision while using 99% fewer labels compared to the SOTA concept drift detection method. The performance is also comparable to the supervised concept drift detection method, which requires 100% of the data to be labeled. The improved performance and ease of adoption of CDSeer are valuable in making ML systems more reliable.
Related papers
- EdgeFD: An Edge-Friendly Drift-Aware Fault Diagnosis System for
Industrial IoT [0.0]
We propose the Drift-Aware Weight Consolidation (DAWC) to mitigate the challenges posed by frequent data drift in the industrial Internet of Things (IIoT)
DAWC efficiently manages multiple data drift scenarios, minimizing the need for constant model fine-tuning on edge devices.
We have also developed a comprehensive diagnosis and visualization platform.
arXiv Detail & Related papers (2023-10-07T06:48:07Z) - UnLoc: A Universal Localization Method for Autonomous Vehicles using
LiDAR, Radar and/or Camera Input [51.150605800173366]
UnLoc is a novel unified neural modeling approach for localization with multi-sensor input in all weather conditions.
Our method is extensively evaluated on Oxford Radar RobotCar, ApolloSouthBay and Perth-WA datasets.
arXiv Detail & Related papers (2023-07-03T04:10:55Z) - CADM: Confusion Model-based Detection Method for Real-drift in Chunk
Data Stream [3.0885191226198785]
Concept drift detection has attracted considerable attention due to its importance in many real-world applications such as health monitoring and fault diagnosis.
We propose a new approach to detect real-drift in the chunk data stream with limited annotations based on concept confusion.
arXiv Detail & Related papers (2023-03-25T08:59:27Z) - SECOE: Alleviating Sensors Failure in Machine Learning-Coupled IoT
Systems [0.0]
This paper proposes SECOE, a proactive approach for alleviating potentially simultaneous sensor failures.
SECOE includes a novel technique to minimize the number of models in the ensemble by harnessing the correlations among sensors.
Experiments reveal that SECOE effectively preserves prediction accuracy in the presence of sensor failures.
arXiv Detail & Related papers (2022-10-05T10:58:39Z) - Incremental Online Learning Algorithms Comparison for Gesture and Visual
Smart Sensors [68.8204255655161]
This paper compares four state-of-the-art algorithms in two real applications: gesture recognition based on accelerometer data and image classification.
Our results confirm these systems' reliability and the feasibility of deploying them in tiny-memory MCUs.
arXiv Detail & Related papers (2022-09-01T17:05:20Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Scanflow: A multi-graph framework for Machine Learning workflow
management, supervision, and debugging [0.0]
We propose a novel containerized directed graph framework to support end-to-end Machine Learning workflow management.
The framework allows defining and deploying ML in containers, tracking their metadata, checking their behavior in production, and improving the models by using both learned and human-provided knowledge.
arXiv Detail & Related papers (2021-11-04T17:01:12Z) - Machine Learning Model Drift Detection Via Weak Data Slices [5.319802998033767]
We propose a method that utilizes feature space rules, called data slices, for drift detection.
We provide experimental indications that our method is likely to identify that the ML model will likely change in performance, based on changes in the underlying data.
arXiv Detail & Related papers (2021-08-11T16:55:34Z) - Automated Machine Learning Techniques for Data Streams [91.3755431537592]
This paper surveys the state-of-the-art open-source AutoML tools, applies them to data collected from streams, and measures how their performance changes over time.
The results show that off-the-shelf AutoML tools can provide satisfactory results but in the presence of concept drift, detection or adaptation techniques have to be applied to maintain the predictive accuracy over time.
arXiv Detail & Related papers (2021-06-14T11:42:46Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - Transfer Learning without Knowing: Reprogramming Black-box Machine
Learning Models with Scarce Data and Limited Resources [78.72922528736011]
We propose a novel approach, black-box adversarial reprogramming (BAR), that repurposes a well-trained black-box machine learning model.
Using zeroth order optimization and multi-label mapping techniques, BAR can reprogram a black-box ML model solely based on its input-output responses.
BAR outperforms state-of-the-art methods and yields comparable performance to the vanilla adversarial reprogramming method.
arXiv Detail & Related papers (2020-07-17T01:52:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.