Revisiting randomized choices in isolation forests
- URL: http://arxiv.org/abs/2110.13402v1
- Date: Tue, 26 Oct 2021 04:08:49 GMT
- Title: Revisiting randomized choices in isolation forests
- Authors: David Cortes
- Abstract summary: Isolation forest or "iForest" is an intuitive and widely used algorithm for anomaly detection.
This paper shows that "clustered" diverse outliers can be more easily identified by applying a non-uniformly-random choice of variables and/or thresholds.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Isolation forest or "iForest" is an intuitive and widely used algorithm for
anomaly detection that follows a simple yet effective idea: in a given data
distribution, if a threshold (split point) is selected uniformly at random
within the range of some variable and data points are divided according to
whether they are greater or smaller than this threshold, outlier points are
more likely to end up alone or in the smaller partition. The original procedure
suggested the choice of variable to split and split point within a variable to
be done uniformly at random at each step, but this paper shows that "clustered"
diverse outliers - oftentimes a more interesting class of outliers than others
- can be more easily identified by applying a non-uniformly-random choice of
variables and/or thresholds. Different split guiding criteria are compared and
some are found to result in significantly better outlier discrimination for
certain classes of outliers.
Related papers
- Fuzzy Granule Density-Based Outlier Detection with Multi-Scale Granular Balls [65.44462297594308]
Outlier detection refers to the identification of anomalous samples that deviate significantly from the distribution of normal data.
Most unsupervised outlier detection methods are carefully designed to detect specified outliers.
We propose a fuzzy rough sets-based multi-scale outlier detection method to identify various types of outliers.
arXiv Detail & Related papers (2025-01-06T12:35:51Z) - SoftCVI: Contrastive variational inference with self-generated soft labels [2.5398014196797614]
Variational inference and Markov chain Monte Carlo methods are the predominant tools for this task.
We introduce Soft Contrastive Variational Inference (SoftCVI), which allows a family of variational objectives to be derived through a contrastive estimation framework.
We find that SoftCVI can be used to form objectives which are stable to train and mass-covering, frequently outperforming inference with other variational approaches.
arXiv Detail & Related papers (2024-07-22T14:54:12Z) - Gower's similarity coefficients with automatic weight selection [0.0]
The most popular dissimilarity for mixed-type variables is derived as the complement to one of the Gower's similarity coefficient.
The discussion on the weighting schemes is sometimes misleading since it often ignores that the unweighted "standard" setting hides an unbalanced contribution of the single variables to the overall dissimilarity.
We address this drawback following the recent idea of introducing a weighting scheme that minimizes the differences in the correlation between each contributing dissimilarity and the resulting weighted Gower's dissimilarity.
arXiv Detail & Related papers (2024-01-30T14:21:56Z) - Robust Outlier Rejection for 3D Registration with Variational Bayes [70.98659381852787]
We develop a novel variational non-local network-based outlier rejection framework for robust alignment.
We propose a voting-based inlier searching strategy to cluster the high-quality hypothetical inliers for transformation estimation.
arXiv Detail & Related papers (2023-04-04T03:48:56Z) - Deep learning model solves change point detection for multiple change
types [69.77452691994712]
A change points detection aims to catch an abrupt disorder in data distribution.
We propose an approach that works in the multiple-distributions scenario.
arXiv Detail & Related papers (2022-04-15T09:44:21Z) - Flexible variable selection in the presence of missing data [0.0]
We propose a non-parametric variable selection algorithm combined with multiple imputation to develop flexible panels in the presence of missing-at-random data.
We show that our proposal has good operating characteristics and results in panels with higher classification and variable selection performance.
arXiv Detail & Related papers (2022-02-25T21:41:03Z) - Machine Learning for Multi-Output Regression: When should a holistic
multivariate approach be preferred over separate univariate ones? [62.997667081978825]
Tree-based ensembles such as the Random Forest are modern classics among statistical learning methods.
We compare these methods in extensive simulations to help in answering the primary question when to use multivariate ensemble techniques.
arXiv Detail & Related papers (2022-01-14T08:44:25Z) - Isolation forests: looking beyond tree depth [0.0]
It will take fewer random cuts for an outlier to be left alone in a given subspace as compared to regular observations.
The original idea proposed an outlier score based on the tree depth (number of random cuts) required for isolation.
Experiments here show that using information about the size of the feature space taken and the number of points assigned to it can result in improved results in many situations.
arXiv Detail & Related papers (2021-11-23T04:04:31Z) - Population based change-point detection for the identification of
homozygosity islands [0.0]
We introduce a penalized maximum likelihood approach that can be efficiently computed by a dynamic programming algorithm or approximated by a fast greedy binary splitting algorithm.
We prove both algorithms converge almost surely to the set of change-points under very general assumptions on the distribution and independent sampling of the random vector.
This new approach is motivated by the problem of identifying homozygosity islands on the genome of individuals in a population.
arXiv Detail & Related papers (2021-11-19T12:53:41Z) - Consensus-Guided Correspondence Denoising [67.35345850146393]
We propose to denoise correspondences with a local-to-global consensus learning framework to robustly identify correspondence.
A novel "pruning" block is introduced to distill reliable candidates from initial matches according to their consensus scores estimated by dynamic graphs from local to global regions.
Our method outperforms state-of-the-arts on robust line fitting, wide-baseline image matching and image localization benchmarks by noticeable margins.
arXiv Detail & Related papers (2021-01-03T09:10:00Z) - Minimax Active Learning [61.729667575374606]
Active learning aims to develop label-efficient algorithms by querying the most representative samples to be labeled by a human annotator.
Current active learning techniques either rely on model uncertainty to select the most uncertain samples or use clustering or reconstruction to choose the most diverse set of unlabeled examples.
We develop a semi-supervised minimax entropy-based active learning algorithm that leverages both uncertainty and diversity in an adversarial manner.
arXiv Detail & Related papers (2020-12-18T19:03:40Z) - Rethinking preventing class-collapsing in metric learning with
margin-based losses [81.22825616879936]
Metric learning seeks embeddings where visually similar instances are close and dissimilar instances are apart.
margin-based losses tend to project all samples of a class onto a single point in the embedding space.
We propose a simple modification to the embedding losses such that each sample selects its nearest same-class counterpart in a batch.
arXiv Detail & Related papers (2020-06-09T09:59:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.