Random Similarity Isolation Forests
- URL: http://arxiv.org/abs/2502.19122v1
- Date: Wed, 26 Feb 2025 13:22:19 GMT
- Title: Random Similarity Isolation Forests
- Authors: Sebastian Chwilczyński, Dariusz Brzezinski,
- Abstract summary: We propose a multi-modal outlier detection algorithm called Random Similarity Isolation Forest.<n>Our method combines the notions of isolation and similarity-based projection to handle datasets with mixtures of features of arbitrary data types.
- Score: 1.2845170214324664
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With predictive models becoming prevalent, companies are expanding the types of data they gather. As a result, the collected datasets consist not only of simple numerical features but also more complex objects such as time series, images, or graphs. Such multi-modal data have the potential to improve performance in predictive tasks like outlier detection, where the goal is to identify objects deviating from the main data distribution. However, current outlier detection algorithms are dedicated to individual types of data. Consequently, working with mixed types of data requires either fusing multiple data-specific models or transforming all of the representations into a single format, both of which can hinder predictive performance. In this paper, we propose a multi-modal outlier detection algorithm called Random Similarity Isolation Forest. Our method combines the notions of isolation and similarity-based projection to handle datasets with mixtures of features of arbitrary data types. Experiments performed on 47 benchmark datasets demonstrate that Random Similarity Isolation Forest outperforms five state-of-the-art competitors. Our study shows that the use of multiple modalities can indeed improve the detection of anomalies and highlights the need for new outlier detection benchmarks tailored for multi-modal algorithms.
Related papers
- A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time Series [0.01874930567916036]
Current publicly available datasets are too small, not diverse and feature trivial anomalies.<n>We propose a solution: a diverse, extensive, and non-trivial dataset generated via state-of-the-art simulation tools.<n>We make different versions of the dataset available, where training and test subsets are offered in contaminated and clean versions.<n>As expected, the baseline experimentation shows that the approaches trained on the semi-supervised version of the dataset outperform their unsupervised counterparts.
arXiv Detail & Related papers (2024-11-21T09:03:12Z) - Regularized Contrastive Partial Multi-view Outlier Detection [76.77036536484114]
We propose a novel method named Regularized Contrastive Partial Multi-view Outlier Detection (RCPMOD)
In this framework, we utilize contrastive learning to learn view-consistent information and distinguish outliers by the degree of consistency.
Experimental results on four benchmark datasets demonstrate that our proposed approach could outperform state-of-the-art competitors.
arXiv Detail & Related papers (2024-08-02T14:34:27Z) - Towards a General Time Series Anomaly Detector with Adaptive Bottlenecks and Dual Adversarial Decoders [16.31103717602631]
Time series anomaly detection plays a vital role in a wide range of applications.
Existing methods require training one specific model for each dataset.
We propose a general time series anomaly detection model, which is pre-trained on extensive multi-domain datasets.
arXiv Detail & Related papers (2024-05-24T06:59:43Z) - Deep Ensembles Meets Quantile Regression: Uncertainty-aware Imputation for Time Series [45.76310830281876]
We propose Quantile Sub-Ensembles, a novel method to estimate uncertainty with ensemble of quantile-regression-based task networks.
Our method not only produces accurate imputations that is robust to high missing rates, but also is computationally efficient due to the fast training of its non-generative model.
arXiv Detail & Related papers (2023-12-03T05:52:30Z) - Hybrid Open-set Segmentation with Synthetic Negative Data [0.0]
Open-set segmentation can be conceived by complementing closed-set classification with anomaly detection.
We propose a novel anomaly score that fuses generative and discriminative cues.
Experiments reveal strong open-set performance in spite of negligible computational overhead.
arXiv Detail & Related papers (2023-01-19T11:02:44Z) - Learning to Bound Counterfactual Inference in Structural Causal Models
from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm.
The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources.
It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Random Similarity Forests [2.3204178451683264]
We propose a classification method capable of handling datasets with features of arbitrary data types while retaining each feature's characteristic.
The proposed algorithm, called Random Similarity Forest, uses multiple domain-specific distance measures to combine the predictive performance of Random Forests with the flexibility of Similarity Forests.
We show that Random Similarity Forests are on par with Random Forests on numerical data and outperform them on datasets from complex or mixed data domains.
arXiv Detail & Related papers (2022-04-11T20:14:05Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Tell Me Something I Don't Know: Randomization Strategies for Iterative
Data Mining [0.6100370338020054]
We consider the problem of randomizing data so that previously discovered patterns or models are taken into account.
In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account.
arXiv Detail & Related papers (2020-06-16T19:20:50Z) - Ambiguity in Sequential Data: Predicting Uncertain Futures with
Recurrent Models [110.82452096672182]
We propose an extension of the Multiple Hypothesis Prediction (MHP) model to handle ambiguous predictions with sequential data.
We also introduce a novel metric for ambiguous problems, which is better suited to account for uncertainties.
arXiv Detail & Related papers (2020-03-10T09:15:42Z) - Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction.
We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data.
Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.