SECODA: Segmentation- and Combination-Based Detection of Anomalies
- URL: http://arxiv.org/abs/2008.06869v1
- Date: Sun, 16 Aug 2020 10:03:14 GMT
- Title: SECODA: Segmentation- and Combination-Based Detection of Anomalies
- Authors: Ralph Foorthuis
- Abstract summary: SECODA is an unsupervised non-parametric anomaly detection algorithm for datasets containing continuous and categorical attributes.
The algorithm has a low memory imprint and its runtime performance scales linearly with the size of the dataset.
An evaluation with simulated and real-life datasets shows that this algorithm is able to identify many different types of anomalies.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study introduces SECODA, a novel general-purpose unsupervised
non-parametric anomaly detection algorithm for datasets containing continuous
and categorical attributes. The method is guaranteed to identify cases with
unique or sparse combinations of attribute values. Continuous attributes are
discretized repeatedly in order to correctly determine the frequency of such
value combinations. The concept of constellations, exponentially increasing
weights and discretization cut points, as well as a pruning heuristic are used
to detect anomalies with an optimal number of iterations. Moreover, the
algorithm has a low memory imprint and its runtime performance scales linearly
with the size of the dataset. An evaluation with simulated and real-life
datasets shows that this algorithm is able to identify many different types of
anomalies, including complex multidimensional instances. An evaluation in terms
of a data quality use case with a real dataset demonstrates that SECODA can
bring relevant and practical value to real-world settings.
Related papers
- Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data.
The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships.
A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z) - Learning to Bound Counterfactual Inference in Structural Causal Models
from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm.
The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources.
It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z) - Autoencoder Based Iterative Modeling and Multivariate Time-Series
Subsequence Clustering Algorithm [0.0]
This paper introduces an algorithm for the detection of change-points and the identification of the corresponding subsequences in transient time-series data (MTSD)
We use a recurrent neural network (RNN) based Autoencoder (AE) which is iteratively trained on incoming data.
A model of the identified subsequence is saved and used for recognition of repeating subsequences as well as fast offline clustering.
arXiv Detail & Related papers (2022-09-09T09:59:56Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - Model-based clustering of partial records [11.193504036335503]
We develop clustering methodology through a model-based approach using the marginal density for the observed values.
We compare our algorithm to the corresponding full expectation-maximization (EM) approach that considers the missing values in the incomplete data set.
Simulation studies demonstrate that our approach has favorable recovery of the true cluster partition compared to case deletion and imputation.
arXiv Detail & Related papers (2021-03-30T13:30:59Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z) - Learning from Incomplete Features by Simultaneous Training of Neural
Networks and Sparse Coding [24.3769047873156]
This paper addresses the problem of training a classifier on a dataset with incomplete features.
We assume that different subsets of features (random or structured) are available at each data instance.
A new supervised learning method is developed to train a general classifier, using only a subset of features per sample.
arXiv Detail & Related papers (2020-11-28T02:20:39Z) - The Impact of Discretization Method on the Detection of Six Types of
Anomalies in Datasets [0.0]
Anomaly detection is the process of identifying cases, or groups of cases, that are in some way unusual and do not fit the general patterns present in the dataset.
Numerous algorithms use discretization of numerical data in their detection processes.
This study investigates the effect of the discretization method on the unsupervised detection of each of the six anomaly types acknowledged in a recent typology of data anomalies.
arXiv Detail & Related papers (2020-08-27T18:43:55Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z) - New advances in enumerative biclustering algorithms with online
partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets.
The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.