Categorical anomaly detection in heterogeneous data using minimum
description length clustering
- URL: http://arxiv.org/abs/2006.07916v1
- Date: Sun, 14 Jun 2020 14:48:37 GMT
- Title: Categorical anomaly detection in heterogeneous data using minimum
description length clustering
- Authors: James Cheney, Xavier Gombau, Ghita Berrada and Sidahmed
Benabderrahmane
- Abstract summary: We propose a meta-algorithm for enhancing any MDL-based anomaly detection model to deal with heterogeneous data.
Our experimental results show that using a discrete mixture model provides competitive performance relative to two previous anomaly detection algorithms.
- Score: 3.871148938060281
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fast and effective unsupervised anomaly detection algorithms have been
proposed for categorical data based on the minimum description length (MDL)
principle. However, they can be ineffective when detecting anomalies in
heterogeneous datasets representing a mixture of different sources, such as
security scenarios in which system and user processes have distinct behavior
patterns. We propose a meta-algorithm for enhancing any MDL-based anomaly
detection model to deal with heterogeneous data by fitting a mixture model to
the data, via a variant of k-means clustering. Our experimental results show
that using a discrete mixture model provides competitive performance relative
to two previous anomaly detection algorithms, while mixtures of more
sophisticated models yield further gains, on both synthetic datasets and
realistic datasets from a security scenario.
Related papers
- Research on Dynamic Data Flow Anomaly Detection based on Machine Learning [11.526496773281938]
In this study, the unsupervised learning method is employed to identify anomalies in dynamic data flows.
By clustering similar data, the model is able to detect data behaviour that deviates significantly from normal traffic without the need for labelled data.
Notably, it demonstrates robust and adaptable performance, particularly in the context of unbalanced data.
arXiv Detail & Related papers (2024-09-23T08:19:15Z) - Anomaly Detection of Tabular Data Using LLMs [54.470648484612866]
We show that pre-trained large language models (LLMs) are zero-shot batch-level anomaly detectors.
We propose an end-to-end fine-tuning strategy to bring out the potential of LLMs in detecting real anomalies.
arXiv Detail & Related papers (2024-06-24T04:17:03Z) - Weakly-supervised anomaly detection for multimodal data distributions [25.60381244912307]
We propose the Weakly-supervised Variational-mixture-model-based Anomaly Detector (WVAD)
WVAD excels in multimodal datasets.
Experimental results on three real-world datasets demonstrate WVAD's superiority.
arXiv Detail & Related papers (2024-06-13T14:14:27Z) - Learning to Bound Counterfactual Inference in Structural Causal Models
from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm.
The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources.
It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z) - A Robust and Flexible EM Algorithm for Mixtures of Elliptical
Distributions with Missing Data [71.9573352891936]
This paper tackles the problem of missing data imputation for noisy and non-Gaussian data.
A new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data.
Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data.
arXiv Detail & Related papers (2022-01-28T10:01:37Z) - Riemannian classification of EEG signals with missing values [67.90148548467762]
This paper proposes two strategies to handle missing data for the classification of electroencephalograms.
The first approach estimates the covariance from imputed data with the $k$-nearest neighbors algorithm; the second relies on the observed data by leveraging the observed-data likelihood within an expectation-maximization algorithm.
As results show, the proposed strategies perform better than the classification based on observed data and allow to keep a high accuracy even when the missing data ratio increases.
arXiv Detail & Related papers (2021-10-19T14:24:50Z) - Explainable Deep Few-shot Anomaly Detection with Deviation Networks [123.46611927225963]
We introduce a novel weakly-supervised anomaly detection framework to train detection models.
The proposed approach learns discriminative normality by leveraging the labeled anomalies and a prior probability.
Our model is substantially more sample-efficient and robust, and performs significantly better than state-of-the-art competing methods in both closed-set and open-set settings.
arXiv Detail & Related papers (2021-08-01T14:33:17Z) - Model-based clustering of partial records [11.193504036335503]
We develop clustering methodology through a model-based approach using the marginal density for the observed values.
We compare our algorithm to the corresponding full expectation-maximization (EM) approach that considers the missing values in the incomplete data set.
Simulation studies demonstrate that our approach has favorable recovery of the true cluster partition compared to case deletion and imputation.
arXiv Detail & Related papers (2021-03-30T13:30:59Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z) - Modeling Heterogeneous Statistical Patterns in High-dimensional Data by
Adversarial Distributions: An Unsupervised Generative Framework [33.652544673163774]
We propose a novel unsupervised generative framework called FIRD, which utilizes adversarial distributions to fit and disentangle the heterogeneous statistical patterns.
When applying to discrete spaces, FIRD effectively distinguishes the synchronized fraudsters from normal users.
arXiv Detail & Related papers (2020-12-15T08:51:20Z) - Factor Analysis of Mixed Data for Anomaly Detection [5.77019633619109]
Anomalous observations may correspond to financial fraud, health risks, or incorrectly measured data in practice.
We show detecting anomalies in high-dimensional mixed data is enhanced through first embedding the data then assessing an anomaly scoring scheme.
arXiv Detail & Related papers (2020-05-25T14:13:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.