Related papers: Comparison of Outlier Detection Techniques for Structured Data

Comparison of Outlier Detection Techniques for Structured Data

URL: http://arxiv.org/abs/2106.08779v1
Date: Wed, 16 Jun 2021 13:40:02 GMT
Title: Comparison of Outlier Detection Techniques for Structured Data
Authors: Amulya Agarwal and Nitin Gupta
Abstract summary: An outlier is an observation or a data point that is far from rest of the data points in a given dataset. It is seen that the removal of outliers from the training dataset before modeling can give better predictions. The goal of this work is to highlight and compare some of the existing outlier detection techniques for the data scientists to use that information for outlier algorithm selection.
Score: 2.2907341026741017
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: An outlier is an observation or a data point that is far from rest of the data points in a given dataset or we can be said that an outlier is away from the center of mass of observations. Presence of outliers can skew statistical measures and data distributions which can lead to misleading representation of the underlying data and relationships. It is seen that the removal of outliers from the training dataset before modeling can give better predictions. With the advancement of machine learning, the outlier detection models are also advancing at a good pace. The goal of this work is to highlight and compare some of the existing outlier detection techniques for the data scientists to use that information for outlier algorithm selection while building a machine learning model.

Related papers

Statistically Testing Training Data for Unwanted Error Patterns using Rule-Oriented Regression [0.5831737970661137]
We provide a method to test training data for flaws, to establish a trustworthy ground-truth for a subsequent training of machine learning models. Our approach extends the abilities of conventional statistical testing by letting the test-condition'' be any condition to describe a pattern in the data. We provide an open source implementation for demonstration and experiments.
arXiv Detail & Related papers (2025-03-24T09:52:36Z)
DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
Unsupervised Event Outlier Detection in Continuous Time [4.375463200687156]
We develop, to the best our knowledge, the first unsupervised outlier detection approach to detecting abnormal events. We train a 'generator' that corrects outliers in the data with a 'discriminator' that learns to discriminate the corrected data from the real data. The experimental results show that our method can detect event outliers more accurately than the state-of-the-art approaches.
arXiv Detail & Related papers (2024-11-25T14:29:39Z)
Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest. Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z)
Quantile-based Maximum Likelihood Training for Outlier Detection [5.902139925693801]
We introduce a quantile-based maximum likelihood objective for learning the inlier distribution to improve the outlier separation during inference. Our approach fits a normalizing flow to pre-trained discriminative features and detects the outliers according to the evaluated log-likelihood.
arXiv Detail & Related papers (2023-08-20T22:27:54Z)
Meta-Learning for Unsupervised Outlier Detection with Optimal Transport [4.035753155957698]
We propose a novel approach to automate outlier detection based on meta-learning from previous datasets with outliers. We leverage optimal transport in particular, to find the dataset with the most similar underlying distribution, and then apply the outlier detection techniques that proved to work best for that data distribution.
arXiv Detail & Related papers (2022-11-01T10:36:48Z)
Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis. We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Unsupervised Outlier Detection using Memory and Contrastive Learning [53.77693158251706]
We think outlier detection can be done in the feature space by measuring the feature distance between outliers and inliers. We propose a framework, MCOD, using a memory module and a contrastive learning module. Our proposed MCOD achieves a considerable performance and outperforms nine state-of-the-art methods.
arXiv Detail & Related papers (2021-07-27T07:35:42Z)
Do We Really Need to Learn Representations from In-domain Data for Outlier Detection? [6.445605125467574]
Methods based on the two-stage framework achieve state-of-the-art performance on this task. We explore the possibility of avoiding the high cost of training a distinct representation for each outlier detection task. In experiments, we demonstrate competitive or better performance on a variety of outlier detection benchmarks compared with previous two-stage methods.
arXiv Detail & Related papers (2021-05-19T17:30:28Z)
Hidden Biases in Unreliable News Detection Datasets [60.71991809782698]
We show that selection bias during data collection leads to undesired artifacts in the datasets. We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap. We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.
arXiv Detail & Related papers (2021-04-20T17:16:41Z)
Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines. Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.