Pre-treatment of outliers and anomalies in plant data: Methodology and
case study of a Vacuum Distillation Unit
- URL: http://arxiv.org/abs/2106.14641v1
- Date: Thu, 17 Jun 2021 11:17:29 GMT
- Title: Pre-treatment of outliers and anomalies in plant data: Methodology and
case study of a Vacuum Distillation Unit
- Authors: Kamil Oster, Stefan G\"uttel, Jonathan L. Shapiro, Lu Chen, Megan
Jobson
- Abstract summary: Two types of outliers were considered: short-term (erroneous data, noise) and long-term outliers (e.g. malfunctioning for longer periods)
We have shown that piecewise 3$sigma$ method offers a better approach to short-term outliers detection than 3$sigma$ method applied to the entire time series.
- Score: 5.728037880354686
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Data pre-treatment plays a significant role in improving data quality, thus
allowing extraction of accurate information from raw data. One of the data
pre-treatment techniques commonly used is outliers detection. The so-called
3${\sigma}$ method is a common practice to identify the outliers. As shown in
the manuscript, it does not identify all outliers, resulting in possible
distortion of the overall statistics of the data. This problem can have a
significant impact on further data analysis and can lead to reduction in the
accuracy of predictive models. There is a plethora of various techniques for
outliers detection, however, aside from theoretical work, they all require case
study work. Two types of outliers were considered: short-term (erroneous data,
noise) and long-term outliers (e.g. malfunctioning for longer periods). The
data used were taken from the vacuum distillation unit (VDU) of an Asian
refinery and included 40 physical sensors (temperature, pressure and flow
rate). We used a modified method for 3${\sigma}$ thresholds to identify the
short-term outliers, i.e. ensors data are divided into chunks determined by
change points and 3${\sigma}$ thresholds are calculated within each chunk
representing near-normal distribution. We have shown that piecewise 3${\sigma}$
method offers a better approach to short-term outliers detection than
3${\sigma}$ method applied to the entire time series. Nevertheless, this does
not perform well for long-term outliers (which can represent another state in
the data). In this case, we used principal component analysis (PCA) with
Hotelling's $T^2$ statistics to identify the long-term outliers. The results
obtained with PCA were subject to DBSCAN clustering method. The outliers (which
were visually obvious and correctly detected by the PCA method) were also
correctly identified by DBSCAN which supported the consistency and accuracy of
the PCA method.
Related papers
- RoCA: Robust Contrastive One-class Time Series Anomaly Detection with Contaminated Data [19.25420308920505]
Methods based on normality assumptions face three limitations.
Their basic assumption is that the training data is uncontaminated (free of anomalies)
This paper proposes a novel robust approach, RoCA, which is the first to address all of the above three challenges.
arXiv Detail & Related papers (2025-03-24T06:52:28Z) - Robust Multilinear Principal Component Analysis [0.0]
Multilinear Principal Component Analysis (MPCA) is an important tool for analyzing tensor data.
Standard MPCA is sensitive to outliers.
This paper introduces a novel robust MPCA method that can handle both types of outliers simultaneously.
arXiv Detail & Related papers (2025-03-10T13:41:03Z) - Fuzzy Granule Density-Based Outlier Detection with Multi-Scale Granular Balls [65.44462297594308]
Outlier detection refers to the identification of anomalous samples that deviate significantly from the distribution of normal data.
Most unsupervised outlier detection methods are carefully designed to detect specified outliers.
We propose a fuzzy rough sets-based multi-scale outlier detection method to identify various types of outliers.
arXiv Detail & Related papers (2025-01-06T12:35:51Z) - Mitigating covariate shift in non-colocated data with learned parameter priors [0.0]
We present textitFragmentation-induced co-shift remediation ($FIcsR$), which minimizes an $f$-divergence between a fragment's covariate distribution and that of the standard cross-validation baseline.
We run extensive classification experiments on multiple data classes, over $40$ datasets, and with data batched over multiple sequence lengths.
The results are promising under all these conditions; with improved accuracy against batch and fold state-of-the-art by more than $5%$ and $10%$, respectively.
arXiv Detail & Related papers (2024-11-10T15:48:29Z) - Imbalanced Aircraft Data Anomaly Detection [103.01418862972564]
Anomaly detection in temporal data from sensors under aviation scenarios is a practical but challenging task.
We propose a Graphical Temporal Data Analysis framework.
It consists three modules, named Series-to-Image (S2I), Cluster-based Resampling Approach using Euclidean Distance (CRD) and Variance-Based Loss (VBL)
arXiv Detail & Related papers (2023-05-17T09:37:07Z) - ODIM: Outlier Detection via Likelihood of Under-Fitted Generative Models [4.956259629094216]
unsupervised outlier detection (UOD) problem refers to a task to identify inliers given training data which contain outliers as well as inliers.
We develop a new method called the outlier detection via the IM effect (ODIM)
Remarkably, the ODIM requires only a few updates, making it computationally efficient at least tens of times faster than other deep-learning-based algorithms.
arXiv Detail & Related papers (2023-01-11T01:02:27Z) - Robust computation of optimal transport by $\beta$-potential
regularization [79.24513412588745]
Optimal transport (OT) has become a widely used tool in the machine learning field to measure the discrepancy between probability distributions.
We propose regularizing OT with the beta-potential term associated with the so-called $beta$-divergence.
We experimentally demonstrate that the transport matrix computed with our algorithm helps estimate a probability distribution robustly even in the presence of outliers.
arXiv Detail & Related papers (2022-12-26T18:37:28Z) - Capturing the Denoising Effect of PCA via Compression Ratio [3.967854215226183]
Principal component analysis (PCA) is one of the most fundamental tools in machine learning.
In this paper, we propose a novel metric called emphcompression ratio to capture the effect of PCA on high-dimensional noisy data.
Building on this new metric, we design a straightforward algorithm that could be used to detect outliers.
arXiv Detail & Related papers (2022-04-22T18:43:47Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Efficient remedies for outlier detection with variational autoencoders [8.80692072928023]
Likelihoods computed by deep generative models are a candidate metric for outlier detection with unlabeled data.
We show that a theoretically-grounded correction readily ameliorates a key bias with VAE likelihood estimates.
We also show that the variance of the likelihoods computed over an ensemble of VAEs also enables robust outlier detection.
arXiv Detail & Related papers (2021-08-19T16:00:58Z) - Noise-Resistant Deep Metric Learning with Probabilistic Instance
Filtering [59.286567680389766]
Noisy labels are commonly found in real-world data, which cause performance degradation of deep neural networks.
We propose Probabilistic Ranking-based Instance Selection with Memory (PRISM) approach for DML.
PRISM calculates the probability of a label being clean, and filters out potentially noisy samples.
arXiv Detail & Related papers (2021-08-03T12:15:25Z) - SLOE: A Faster Method for Statistical Inference in High-Dimensional
Logistic Regression [68.66245730450915]
We develop an improved method for debiasing predictions and estimating frequentist uncertainty for practical datasets.
Our main contribution is SLOE, an estimator of the signal strength with convergence guarantees that reduces the computation time of estimation and inference by orders of magnitude.
arXiv Detail & Related papers (2021-03-23T17:48:56Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z) - robROSE: A robust approach for dealing with imbalanced data in fraud
detection [2.1734195143282697]
A major challenge when trying to detect fraud is that the fraudulent activities form a minority class which make up a very small proportion of the data set.
We present a robust version of ROSE, called robROSE, which combines several promising approaches to cope simultaneously with the problem of imbalanced data.
arXiv Detail & Related papers (2020-03-22T16:11:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.