Distribution and volume based scoring for Isolation Forests
- URL: http://arxiv.org/abs/2309.11450v1
- Date: Wed, 20 Sep 2023 16:27:10 GMT
- Title: Distribution and volume based scoring for Isolation Forests
- Authors: Hichem Dhouib, Alissa Wilms, Paul Boes
- Abstract summary: We make two contributions to the Isolation Forest method for anomaly and outlier detection.
The first is an information-theoretically motivated generalisation of the score function that is used to aggregate the scores across random tree estimators.
The second is an alternative scoring function at the level of the individual tree estimator, in which we replace the depth-based scoring of the Isolation Forest with one based on hyper-volumes associated to an isolation tree's leaf nodes.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We make two contributions to the Isolation Forest method for anomaly and
outlier detection. The first contribution is an information-theoretically
motivated generalisation of the score function that is used to aggregate the
scores across random tree estimators. This generalisation allows one to take
into account not just the ensemble average across trees but instead the whole
distribution. The second contribution is an alternative scoring function at the
level of the individual tree estimator, in which we replace the depth-based
scoring of the Isolation Forest with one based on hyper-volumes associated to
an isolation tree's leaf nodes.
We motivate the use of both of these methods on generated data and also
evaluate them on 34 datasets from the recent and exhaustive ``ADBench''
benchmark, finding significant improvement over the standard isolation forest
for both variants on some datasets and improvement on average across all
datasets for one of the two variants. The code to reproduce our results is made
available as part of the submission.
Related papers
- SureMap: Simultaneous Mean Estimation for Single-Task and Multi-Task Disaggregated Evaluation [75.56845750400116]
Disaggregated evaluation -- estimation of performance of a machine learning model on different subpopulations -- is a core task when assessing performance and group-fairness of AI systems.
We develop SureMap that has high estimation accuracy for both multi-task and single-task disaggregated evaluations of blackbox models.
Our method combines maximum a posteriori (MAP) estimation using a well-chosen prior together with cross-validation-free tuning via Stein's unbiased risk estimate (SURE)
arXiv Detail & Related papers (2024-11-14T17:53:35Z) - Heterogeneous Random Forest [2.0646127669654835]
Heterogeneous Random Forest (HRF) is designed to enhance tree diversity in a meaningful way.
HRF consistently outperformed other ensemble methods in terms of accuracy across the majority of datasets.
arXiv Detail & Related papers (2024-10-24T09:18:55Z) - Learning a Decision Tree Algorithm with Transformers [75.96920867382859]
We introduce MetaTree, a transformer-based model trained via meta-learning to directly produce strong decision trees.
We fit both greedy decision trees and globally optimized decision trees on a large number of datasets, and train MetaTree to produce only the trees that achieve strong generalization performance.
arXiv Detail & Related papers (2024-02-06T07:40:53Z) - Why do Random Forests Work? Understanding Tree Ensembles as
Self-Regularizing Adaptive Smoothers [68.76846801719095]
We argue that the current high-level dichotomy into bias- and variance-reduction prevalent in statistics is insufficient to understand tree ensembles.
We show that forests can improve upon trees by three distinct mechanisms that are usually implicitly entangled.
arXiv Detail & Related papers (2024-02-02T15:36:43Z) - Individualized and Global Feature Attributions for Gradient Boosted
Trees in the Presence of $\ell_2$ Regularization [0.0]
We propose Prediction Decomposition (PreDecomp), a novel individualized feature attribution for boosted trees when they are trained with $ell$ regularization.
We also propose TreeInner, a family of debiased global feature attributions defined in terms of the inner product between any individualized feature attribution and labels on out-sample data for each tree.
arXiv Detail & Related papers (2022-11-08T17:56:22Z) - Data-driven advice for interpreting local and global model predictions
in bioinformatics problems [17.685881417954782]
Conditional feature contributions (CFCs) provide textitlocal, case-by-case explanations of a prediction.
We compare the explanations computed by both methods on a set of 164 publicly available classification problems.
For random forests, we find extremely high similarities and correlations of both local and global SHAP values and CFC scores.
arXiv Detail & Related papers (2021-08-13T12:41:39Z) - Eigen Analysis of Self-Attention and its Reconstruction from Partial
Computation [58.80806716024701]
We study the global structure of attention scores computed using dot-product based self-attention.
We find that most of the variation among attention scores lie in a low-dimensional eigenspace.
We propose to compute scores only for a partial subset of token pairs, and use them to estimate scores for the remaining pairs.
arXiv Detail & Related papers (2021-06-16T14:38:42Z) - Optimal trees selection for classification via out-of-bag assessment and
sub-bagging [0.0]
The predictive performance of tree based machine learning methods, in general, improves with a decreasing rate as the size of training data increases.
We investigate this in optimal trees ensemble (OTE) where the method fails to learn from some of the training observations due to internal validation.
Modified tree selection methods are thus proposed for OTE to cater for the loss of training observations in internal validation.
arXiv Detail & Related papers (2020-12-30T19:44:11Z) - JSRT: James-Stein Regression Tree [55.2059664267247]
Regression tree (RT) has been widely used in machine learning and data mining community.
In practice, the performance of RT relies heavily on the local mean of samples from an individual node during the tree construction/prediction stage.
We propose a novel regression tree, named James-Stein Regression Tree (JSRT) by considering global information from different nodes.
arXiv Detail & Related papers (2020-10-18T16:28:49Z) - A General Method for Robust Learning from Batches [56.59844655107251]
We consider a general framework of robust learning from batches, and determine the limits of both classification and distribution estimation over arbitrary, including continuous, domains.
We derive the first robust computationally-efficient learning algorithms for piecewise-interval classification, and for piecewise-polynomial, monotone, log-concave, and gaussian-mixture distribution estimation.
arXiv Detail & Related papers (2020-02-25T18:53:25Z) - Fr\'echet random forests for metric space valued regression with non
euclidean predictors [0.0]
We introduce Fr'echet trees and Fr'echet random forests, which allow to handle data for which input and output variables take values in general metric spaces.
A consistency theorem for Fr'echet regressogram predictor using data-driven partitions is given and applied to Fr'echet purely uniformly random trees.
arXiv Detail & Related papers (2019-06-04T22:07:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.