ECOD: Unsupervised Outlier Detection Using Empirical Cumulative
Distribution Functions
- URL: http://arxiv.org/abs/2201.00382v1
- Date: Sun, 2 Jan 2022 17:28:35 GMT
- Title: ECOD: Unsupervised Outlier Detection Using Empirical Cumulative
Distribution Functions
- Authors: Zheng Li, Yue Zhao, Xiyang Hu, Nicola Botta, Cezar Ionescu, George H.
Chen
- Abstract summary: Outlier detection refers to the identification of data points that deviate from a general data distribution.
We present ECOD (Empirical-Cumulative-distribution-based Outlier Detection), which is inspired by the fact that outliers are often the "rare events" that appear in the tails of a distribution.
- Score: 12.798256312657136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Outlier detection refers to the identification of data points that deviate
from a general data distribution. Existing unsupervised approaches often suffer
from high computational cost, complex hyperparameter tuning, and limited
interpretability, especially when working with large, high-dimensional
datasets. To address these issues, we present a simple yet effective algorithm
called ECOD (Empirical-Cumulative-distribution-based Outlier Detection), which
is inspired by the fact that outliers are often the "rare events" that appear
in the tails of a distribution. In a nutshell, ECOD first estimates the
underlying distribution of the input data in a nonparametric fashion by
computing the empirical cumulative distribution per dimension of the data. ECOD
then uses these empirical distributions to estimate tail probabilities per
dimension for each data point. Finally, ECOD computes an outlier score of each
data point by aggregating estimated tail probabilities across dimensions. Our
contributions are as follows: (1) we propose a novel outlier detection method
called ECOD, which is both parameter-free and easy to interpret; (2) we perform
extensive experiments on 30 benchmark datasets, where we find that ECOD
outperforms 11 state-of-the-art baselines in terms of accuracy, efficiency, and
scalability; and (3) we release an easy-to-use and scalable (with distributed
support) Python implementation for accessibility and reproducibility.
Related papers
- Geometry-Aware Instrumental Variable Regression [56.16884466478886]
We propose a transport-based IV estimator that takes into account the geometry of the data manifold through data-derivative information.
We provide a simple plug-and-play implementation of our method that performs on par with related estimators in standard settings.
arXiv Detail & Related papers (2024-05-19T17:49:33Z) - SCOD: From Heuristics to Theory [4.512926716151403]
This paper addresses the problem of designing reliable prediction models that abstain from predictions when faced with uncertain or out-of-distribution samples.
We make three key contributions to Selective Classification in the presence of Out-of-Distribution data (SCOD)
arXiv Detail & Related papers (2024-03-25T16:36:13Z) - Distributed Semi-Supervised Sparse Statistical Inference [6.685997976921953]
A debiased estimator is a crucial tool in statistical inference for high-dimensional model parameters.
Traditional methods require computing a debiased estimator on every machine.
An efficient multi-round distributed debiased estimator, which integrates both labeled and unlabelled data, is developed.
arXiv Detail & Related papers (2023-06-17T17:30:43Z) - A parametric distribution for exact post-selection inference with data
carving [0.0]
Post-selection inference (PoSI) is a technique for obtaining valid confidence intervals and p-values when hypothesis generation and testing use the same source of data.
Data carving is a variant of PoSI in which a portion of held out data is combined with the hypothesis generating data at inference time.
arXiv Detail & Related papers (2023-05-21T22:29:55Z) - Data thinning for convolution-closed distributions [2.299914829977005]
We propose data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation.
We show that data thinning can be used to validate the results of unsupervised learning approaches.
arXiv Detail & Related papers (2023-01-18T02:47:41Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - COPOD: Copula-Based Outlier Detection [7.963284082401154]
Outlier detection refers to the identification of rare items that are deviant from the general data distribution.
Existing approaches suffer from high computational complexity, low predictive capability, and limited interpretability.
We present a novel outlier detection algorithm called COPOD.
arXiv Detail & Related papers (2020-09-20T16:06:39Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z) - Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines.
Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z) - Improving Generative Adversarial Networks with Local Coordinate Coding [150.24880482480455]
Generative adversarial networks (GANs) have shown remarkable success in generating realistic data from some predefined prior distribution.
In practice, semantic information might be represented by some latent distribution learned from data.
We propose an LCCGAN model with local coordinate coding (LCC) to improve the performance of generating data.
arXiv Detail & Related papers (2020-07-28T09:17:50Z) - Generalized ODIN: Detecting Out-of-distribution Image without Learning
from Out-of-distribution Data [87.61504710345528]
We propose two strategies for freeing a neural network from tuning with OoD data, while improving its OoD detection performance.
We specifically propose to decompose confidence scoring as well as a modified input pre-processing method.
Our further analysis on a larger scale image dataset shows that the two types of distribution shifts, specifically semantic shift and non-semantic shift, present a significant difference.
arXiv Detail & Related papers (2020-02-26T04:18:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.