Comparison of Outlier Detection Techniques for Structured Data
- URL: http://arxiv.org/abs/2106.08779v1
- Date: Wed, 16 Jun 2021 13:40:02 GMT
- Title: Comparison of Outlier Detection Techniques for Structured Data
- Authors: Amulya Agarwal and Nitin Gupta
- Abstract summary: An outlier is an observation or a data point that is far from rest of the data points in a given dataset.
It is seen that the removal of outliers from the training dataset before modeling can give better predictions.
The goal of this work is to highlight and compare some of the existing outlier detection techniques for the data scientists to use that information for outlier algorithm selection.
- Score: 2.2907341026741017
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An outlier is an observation or a data point that is far from rest of the
data points in a given dataset or we can be said that an outlier is away from
the center of mass of observations. Presence of outliers can skew statistical
measures and data distributions which can lead to misleading representation of
the underlying data and relationships. It is seen that the removal of outliers
from the training dataset before modeling can give better predictions. With the
advancement of machine learning, the outlier detection models are also
advancing at a good pace. The goal of this work is to highlight and compare
some of the existing outlier detection techniques for the data scientists to
use that information for outlier algorithm selection while building a machine
learning model.
Related papers
- Unsupervised Event Outlier Detection in Continuous Time [4.375463200687156]
We develop, to the best our knowledge, the first unsupervised outlier detection approach to detecting abnormal events.
We train a 'generator' that corrects outliers in the data with a 'discriminator' that learns to discriminate the corrected data from the real data.
The experimental results show that our method can detect event outliers more accurately than the state-of-the-art approaches.
arXiv Detail & Related papers (2024-11-25T14:29:39Z) - Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest.
Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z) - Quantile-based Maximum Likelihood Training for Outlier Detection [5.902139925693801]
We introduce a quantile-based maximum likelihood objective for learning the inlier distribution to improve the outlier separation during inference.
Our approach fits a normalizing flow to pre-trained discriminative features and detects the outliers according to the evaluated log-likelihood.
arXiv Detail & Related papers (2023-08-20T22:27:54Z) - Meta-Learning for Unsupervised Outlier Detection with Optimal Transport [4.035753155957698]
We propose a novel approach to automate outlier detection based on meta-learning from previous datasets with outliers.
We leverage optimal transport in particular, to find the dataset with the most similar underlying distribution, and then apply the outlier detection techniques that proved to work best for that data distribution.
arXiv Detail & Related papers (2022-11-01T10:36:48Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Unsupervised Outlier Detection using Memory and Contrastive Learning [53.77693158251706]
We think outlier detection can be done in the feature space by measuring the feature distance between outliers and inliers.
We propose a framework, MCOD, using a memory module and a contrastive learning module.
Our proposed MCOD achieves a considerable performance and outperforms nine state-of-the-art methods.
arXiv Detail & Related papers (2021-07-27T07:35:42Z) - Do We Really Need to Learn Representations from In-domain Data for
Outlier Detection? [6.445605125467574]
Methods based on the two-stage framework achieve state-of-the-art performance on this task.
We explore the possibility of avoiding the high cost of training a distinct representation for each outlier detection task.
In experiments, we demonstrate competitive or better performance on a variety of outlier detection benchmarks compared with previous two-stage methods.
arXiv Detail & Related papers (2021-05-19T17:30:28Z) - Hidden Biases in Unreliable News Detection Datasets [60.71991809782698]
We show that selection bias during data collection leads to undesired artifacts in the datasets.
We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap.
We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.
arXiv Detail & Related papers (2021-04-20T17:16:41Z) - Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines.
Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.