Automatically detecting data drift in machine learning classifiers
- URL: http://arxiv.org/abs/2111.05672v1
- Date: Wed, 10 Nov 2021 12:34:14 GMT
- Title: Automatically detecting data drift in machine learning classifiers
- Authors: Samuel Ackerman, Orna Raz, Marcel Zalmanovici, Aviad Zlotnick
- Abstract summary: We term changes that affect machine learning performance data drift' or drift'
We propose an approach based solely on classifier suggested labels and its confidence in them, for alerting on data distribution or feature space changes that are likely to cause data drift.
- Score: 2.202253618096515
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Classifiers and other statistics-based machine learning (ML) techniques
generalize, or learn, based on various statistical properties of the training
data. The assumption underlying statistical ML resulting in theoretical or
empirical performance guarantees is that the distribution of the training data
is representative of the production data distribution. This assumption often
breaks; for instance, statistical distributions of the data may change. We term
changes that affect ML performance `data drift' or `drift'.
Many classification techniques compute a measure of confidence in their
results. This measure might not reflect the actual ML performance. A famous
example is the Panda picture that is correctly classified as such with a
confidence of about 60\%, but when noise is added it is incorrectly classified
as a Gibbon with a confidence of above 99\%. However, the work we report on
here suggests that a classifier's measure of confidence can be used for the
purpose of detecting data drift.
We propose an approach based solely on classifier suggested labels and its
confidence in them, for alerting on data distribution or feature space changes
that are likely to cause data drift. Our approach identities degradation in
model performance and does not require labeling of data in production which is
often lacking or delayed. Our experiments with three different data sets and
classifiers demonstrate the effectiveness of this approach in detecting data
drift. This is especially encouraging as the classification itself may or may
not be correct and no model input data is required. We further explore the
statistical approach of sequential change-point tests to automatically
determine the amount of data needed in order to identify drift while
controlling the false positive rate (Type-1 error).
Related papers
- CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time.
We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z) - Conformal prediction for the design problem [72.14982816083297]
In many real-world deployments of machine learning, we use a prediction algorithm to choose what data to test next.
In such settings, there is a distinct type of distribution shift between the training and test data.
We introduce a method to quantify predictive uncertainty in such settings.
arXiv Detail & Related papers (2022-02-08T02:59:12Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D
Object Detection [85.11649974840758]
3D object detection networks tend to be biased towards the data they are trained on.
We propose a single-frame approach for source-free, unsupervised domain adaptation of lidar-based 3D object detectors.
arXiv Detail & Related papers (2021-11-30T18:42:42Z) - Robustness to Spurious Correlations in Text Classification via
Automatically Generated Counterfactuals [8.827892752465958]
We propose to train a robust text classifier by augmenting the training data with automatically generated counterfactual data.
We show that the robust classifier makes meaningful and trustworthy predictions by emphasizing causal features and de-emphasizing non-causal features.
arXiv Detail & Related papers (2020-12-18T03:57:32Z) - Detection of data drift and outliers affecting machine learning model
performance over time [5.319802998033767]
Drift is distribution change between the training and deployment data.
We wish to detect these changes but can't measure accuracy without deployment data labels.
We instead detect drift indirectly by nonparametrically testing the distribution of model prediction confidence for changes.
arXiv Detail & Related papers (2020-12-16T20:50:12Z) - Uncertainty-aware Self-training for Text Classification with Few Labels [54.13279574908808]
We study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck.
We propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network.
We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models.
arXiv Detail & Related papers (2020-06-27T08:13:58Z) - Robust Variational Autoencoder for Tabular Data with Beta Divergence [0.0]
We propose a robust variational autoencoder with mixed categorical and continuous features.
Our results on the anomaly detection application for network traffic datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-06-15T08:09:34Z) - Learning with Out-of-Distribution Data for Audio Classification [60.48251022280506]
We show that detecting and relabelling certain OOD instances, rather than discarding them, can have a positive effect on learning.
The proposed method is shown to improve the performance of convolutional neural networks by a significant margin.
arXiv Detail & Related papers (2020-02-11T21:08:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.