Unsupervised anomaly detection algorithms on real-world data: how many
do we need?
- URL: http://arxiv.org/abs/2305.00735v1
- Date: Mon, 1 May 2023 09:27:42 GMT
- Title: Unsupervised anomaly detection algorithms on real-world data: how many
do we need?
- Authors: Roel Bouman, Zaharah Bukhsh, Tom Heskes
- Abstract summary: This study is the largest comparison of unsupervised anomaly detection algorithms to date.
On the local datasets the $k$NN ($k$-nearest neighbor) algorithm comes out on top.
On the global datasets the EIF (extended isolation forest) algorithm performs the best.
- Score: 1.4610038284393165
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this study we evaluate 32 unsupervised anomaly detection algorithms on 52
real-world multivariate tabular datasets, performing the largest comparison of
unsupervised anomaly detection algorithms to date. On this collection of
datasets, the $k$-thNN (distance to the $k$-nearest neighbor) algorithm
significantly outperforms the most other algorithms. Visualizing and then
clustering the relative performance of the considered algorithms on all
datasets, we identify two clear clusters: one with ``local'' datasets, and
another with ``global'' datasets. ``Local'' anomalies occupy a region with low
density when compared to nearby samples, while ``global'' occupy an overall low
density region in the feature space. On the local datasets the $k$NN
($k$-nearest neighbor) algorithm comes out on top. On the global datasets, the
EIF (extended isolation forest) algorithm performs the best. Also taking into
consideration the algorithms' computational complexity, a toolbox with these
three unsupervised anomaly detection algorithms suffices for finding anomalies
in this representative collection of multivariate datasets. By providing access
to code and datasets, our study can be easily reproduced and extended with more
algorithms and/or datasets.
Related papers
- A Novel Adaptive Fine-Tuning Algorithm for Multimodal Models: Self-Optimizing Classification and Selection of High-Quality Datasets in Remote Sensing [46.603157010223505]
We propose an adaptive fine-tuning algorithm for multimodal large models.
We train the model on two 3090 GPU using one-third of the GeoChat multimodal remote sensing dataset.
The model achieved scores of 89.86 and 77.19 on the UCMerced and AID evaluation datasets.
arXiv Detail & Related papers (2024-09-20T09:19:46Z) - ParlayANN: Scalable and Deterministic Parallel Graph-Based Approximate
Nearest Neighbor Search Algorithms [5.478671305092084]
We introduce ParlayANN, a library of deterministic and parallel graph-based approximate nearest neighbor search algorithms.
We develop novel parallel implementations for four state-of-the-art graph-based ANNS algorithms that scale to billion-scale datasets.
arXiv Detail & Related papers (2023-05-07T19:28:23Z) - Quantum Algorithm for Unsupervised Anomaly Detection [5.4335077019052145]
Anomaly detection plays a critical role in fraud detection, health care intrusion detection, military surveillance, etc.
The Local Outlier Factor algorithm (LOF algorithm) has been extensively studied.
Here we present a quantum LOF algorithm consisting of three parts corresponding to the classical algorithm.
arXiv Detail & Related papers (2023-04-18T03:20:11Z) - Linearized Wasserstein dimensionality reduction with approximation
guarantees [65.16758672591365]
LOT Wassmap is a computationally feasible algorithm to uncover low-dimensional structures in the Wasserstein space.
We show that LOT Wassmap attains correct embeddings and that the quality improves with increased sample size.
We also show how LOT Wassmap significantly reduces the computational cost when compared to algorithms that depend on pairwise distance computations.
arXiv Detail & Related papers (2023-02-14T22:12:16Z) - Differentially-Private Hierarchical Clustering with Provable
Approximation Guarantees [79.59010418610625]
We study differentially private approximation algorithms for hierarchical clustering.
We show strong lower bounds for the problem: that any $epsilon$-DP algorithm must exhibit $O(|V|2/ epsilon)$-additive error for an input dataset.
We propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly.
arXiv Detail & Related papers (2023-01-31T19:14:30Z) - Learning to Bound Counterfactual Inference in Structural Causal Models
from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm.
The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources.
It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z) - SSDBCODI: Semi-Supervised Density-Based Clustering with Outliers
Detection Integrated [1.8444322599555096]
Clustering analysis is one of the critical tasks in machine learning.
Due to the fact that the performance of clustering clustering can be significantly eroded by outliers, algorithms try to incorporate the process of outlier detection.
We have proposed SSDBCODI, a semi-supervised detection element.
arXiv Detail & Related papers (2022-08-10T21:06:38Z) - Multi-granularity Relabeled Under-sampling Algorithm for Imbalanced Data [15.030895782548576]
The imbalanced classification problem turns out to be one of the important and challenging problems in data mining and machine learning.
The Tomek-Link sampling algorithm can effectively reduce the class overlap on data, remove the majority instances that are difficult to distinguish, and improve the algorithm classification accuracy.
However, the Tomek-Links under-sampling algorithm only considers the boundary instances that are the nearest neighbors to each other globally and ignores the potential local overlapping instances.
This paper proposes a multi-granularity relabeled under-sampling algorithm (MGRU) which fully considers the local information of the data set in the
arXiv Detail & Related papers (2022-01-11T14:07:55Z) - Clustered Hierarchical Anomaly and Outlier Detection Algorithms [0.0]
We present CLAM, a fast hierarchical clustering technique that learns a manifold in a Banach space defined by a distance metric.
On 24 publicly available datasets, we compare the performance of CHAODA to a variety of state-of-the-art unsupervised anomaly-detection algorithms.
arXiv Detail & Related papers (2021-02-09T15:27:52Z) - A Systematic Characterization of Sampling Algorithms for Open-ended
Language Generation [71.31905141672529]
We study the widely adopted ancestral sampling algorithms for auto-regressive language models.
We identify three key properties that are shared among them: entropy reduction, order preservation, and slope preservation.
We find that the set of sampling algorithms that satisfies these properties performs on par with the existing sampling algorithms.
arXiv Detail & Related papers (2020-09-15T17:28:42Z) - FedPD: A Federated Learning Framework with Optimal Rates and Adaptivity
to Non-IID Data [59.50904660420082]
Federated Learning (FL) has become a popular paradigm for learning from distributed data.
To effectively utilize data at different devices without moving them to the cloud, algorithms such as the Federated Averaging (FedAvg) have adopted a "computation then aggregation" (CTA) model.
arXiv Detail & Related papers (2020-05-22T23:07:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.