A Probabilistic Transformation of Distance-Based Outliers
- URL: http://arxiv.org/abs/2305.09446v2
- Date: Tue, 18 Jul 2023 20:01:42 GMT
- Title: A Probabilistic Transformation of Distance-Based Outliers
- Authors: David Muhr, Michael Affenzeller, Josef K\"ung
- Abstract summary: We describe a generic transformation of distance-based outlier scores into interpretable, probabilistic estimates.
The transformation is ranking-stable and increases the contrast between normal and outlier data points.
Our work generalizes to a wide range of distance-based outlier detection methods.
- Score: 2.1055643409860743
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The scores of distance-based outlier detection methods are difficult to
interpret, making it challenging to determine a cut-off threshold between
normal and outlier data points without additional context. We describe a
generic transformation of distance-based outlier scores into interpretable,
probabilistic estimates. The transformation is ranking-stable and increases the
contrast between normal and outlier data points. Determining distance
relationships between data points is necessary to identify the nearest-neighbor
relationships in the data, yet, most of the computed distances are typically
discarded. We show that the distances to other data points can be used to model
distance probability distributions and, subsequently, use the distributions to
turn distance-based outlier scores into outlier probabilities. Our experiments
show that the probabilistic transformation does not impact detection
performance over numerous tabular and image benchmark datasets but results in
interpretable outlier scores with increased contrast between normal and outlier
samples. Our work generalizes to a wide range of distance-based outlier
detection methods, and because existing distance computations are used, it adds
no significant computational overhead.
Related papers
- Mean Estimation with User-level Privacy under Data Heterogeneity [54.07947274508013]
Different users may possess vastly different numbers of data points.
It cannot be assumed that all users sample from the same underlying distribution.
We propose a simple model of heterogeneous user data that allows user data to differ in both distribution and quantity of data.
arXiv Detail & Related papers (2023-07-28T23:02:39Z) - Positive Difference Distribution for Image Outlier Detection using
Normalizing Flows and Contrastive Data [2.9005223064604078]
Likelihoods learned by a generative model, e.g., a normalizing flow via standard log-likelihood training, perform poorly as an outlier score.
We propose to use an unlabelled auxiliary dataset and a probabilistic outlier score for outlier detection.
We show that this is equivalent to learning the normalized positive difference between the in-distribution and the contrastive feature density.
arXiv Detail & Related papers (2022-08-30T07:00:46Z) - Robust Multi-Object Tracking by Marginal Inference [92.48078680697311]
Multi-object tracking in videos requires to solve a fundamental problem of one-to-one assignment between objects in adjacent frames.
We present an efficient approach to compute a marginal probability for each pair of objects in real time.
It achieves competitive results on MOT17 and MOT20 benchmarks.
arXiv Detail & Related papers (2022-08-07T14:04:45Z) - Kernel distance measures for time series, random fields and other
structured data [71.61147615789537]
kdiff is a novel kernel-based measure for estimating distances between instances of structured data.
It accounts for both self and cross similarities across the instances and is defined using a lower quantile of the distance distribution.
Some theoretical results are provided for separability conditions using kdiff as a distance measure for clustering and classification problems.
arXiv Detail & Related papers (2021-09-29T22:54:17Z) - The Exploitation of Distance Distributions for Clustering [3.42658286826597]
In cluster analysis, different properties for distance distributions are judged to be relevant for appropriate distance selection.
By systematically investigating this specification using distribution analysis through a mirrored-density plot, it is shown that multimodal distance distributions are preferable in cluster analysis.
Experiments are performed on several artificial datasets and natural datasets for the task of clustering.
arXiv Detail & Related papers (2021-08-22T06:22:08Z) - Comparison of Outlier Detection Techniques for Structured Data [2.2907341026741017]
An outlier is an observation or a data point that is far from rest of the data points in a given dataset.
It is seen that the removal of outliers from the training dataset before modeling can give better predictions.
The goal of this work is to highlight and compare some of the existing outlier detection techniques for the data scientists to use that information for outlier algorithm selection.
arXiv Detail & Related papers (2021-06-16T13:40:02Z) - On the relation between statistical learning and perceptual distances [61.25815733012866]
We show that perceptual sensitivity is correlated with the probability of an image in its close neighborhood.
We also explore the relation between distances induced by autoencoders and the probability distribution of the data used for training them.
arXiv Detail & Related papers (2021-06-08T14:56:56Z) - Pretrained equivariant features improve unsupervised landmark discovery [69.02115180674885]
We formulate a two-step unsupervised approach that overcomes this challenge by first learning powerful pixel-based features.
Our method produces state-of-the-art results in several challenging landmark detection datasets.
arXiv Detail & Related papers (2021-04-07T05:42:11Z) - $\gamma$-ABC: Outlier-Robust Approximate Bayesian Computation Based on a
Robust Divergence Estimator [95.71091446753414]
We propose to use a nearest-neighbor-based $gamma$-divergence estimator as a data discrepancy measure.
Our method achieves significantly higher robustness than existing discrepancy measures.
arXiv Detail & Related papers (2020-06-13T06:09:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.