Theoretical Investigation on Inductive Bias of Isolation Forest
- URL: http://arxiv.org/abs/2505.12825v2
- Date: Fri, 03 Oct 2025 11:09:03 GMT
- Title: Theoretical Investigation on Inductive Bias of Isolation Forest
- Authors: Qin-Cheng Zheng, Shao-Qun Zhang, Shen-Huan Lyu, Yuan Jiang, Zhi-Hua Zhou,
- Abstract summary: Isolation Forest (iForest) stands out as a widely-used unsupervised anomaly detector.<n>Despite its widespread adoption, a theoretical foundation explaining iForest's success remains unclear.<n>This paper focuses on the inductive bias of iForest, which theoretically elucidates under what circumstances and to what extent iForest works well.
- Score: 49.99478055888046
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Isolation Forest (iForest) stands out as a widely-used unsupervised anomaly detector, primarily owing to its remarkable runtime efficiency and superior performance in large-scale tasks. Despite its widespread adoption, a theoretical foundation explaining iForest's success remains unclear. This paper focuses on the inductive bias of iForest, which theoretically elucidates under what circumstances and to what extent iForest works well. The key is to formulate the growth process of iForest, where the split dimensions and split values are randomly selected. We model the growth process of iForest as a random walk, enabling us to derive the expected depth function, which is the outcome of iForest, using transition probabilities. The case studies reveal key inductive biases: iForest exhibits lower sensitivity to central anomalies while demonstrating greater parameter adaptability compared to $k$-Nearest Neighbor. Our study provides a theoretical understanding of the effectiveness of iForest and establishes a foundation for further theoretical exploration.
Related papers
- Observationally Informed Adaptive Causal Experimental Design [55.998153710215654]
We propose Active Residual Learning, a new paradigm that leverages the observational model as a foundational prior.<n>This approach shifts the experimental focus from learning target causal quantities from scratch to efficiently estimating the residuals required to correct observational bias.<n> Experiments on synthetic and semi-synthetic benchmarks demonstrate that R-Design significantly outperforms baselines.
arXiv Detail & Related papers (2026-03-04T06:52:37Z) - Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces [31.37944377681284]
We use PITA, a dataset of over 23 million statements in propositional logic and their corresponding proofs.<n>We find that RT models generalize well on broad and shallow subsets, while deteriorating on narrow and deep subsets relative to non-RT baselines.<n>Our resulting theory suggests fundamental scalings that limit how well RT models perform on deep tasks, and highlights their generalization strengths on broad tasks.
arXiv Detail & Related papers (2026-02-16T02:20:37Z) - Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning [53.58654277639939]
In-context exploration is the intrinsic ability to generate, verify, and refine hypotheses within a single continuous context.<n>We propose Length-Incentivized Exploration, which explicitly encourages models to explore more.<n>Our method achieves an average improvement of 4.4% on in-domain tasks and a 2.7% gain on out-of-domain benchmarks.
arXiv Detail & Related papers (2026-02-12T09:24:32Z) - Diffusion Bridge or Flow Matching? A Unifying Framework and Comparative Analysis [57.614436689939986]
Diffusion Bridge and Flow Matching have both demonstrated compelling empirical performance in transformation between arbitrary distributions.<n>We recast their frameworks through the lens of Optimal Control and prove that the cost function of the Diffusion Bridge is lower.<n>To corroborate these theoretical claims, we propose a novel, powerful architecture for Diffusion Bridge built on a latent Transformer.
arXiv Detail & Related papers (2025-09-29T09:45:22Z) - Positional Biases Shift as Inputs Approach Context Window Limits [57.00239097102958]
The LiM effect is strongest when inputs occupy up to 50% of a model's context window.<n>We observe a distance-based bias, where model performance is better when relevant information is closer to the end of the input.
arXiv Detail & Related papers (2025-08-10T20:40:24Z) - Open-ended Scientific Discovery via Bayesian Surprise [63.26412847240136]
AutoDS is a method for open-ended scientific discovery that instead drives scientific exploration using Bayesian surprise.<n>We evaluate AutoDS in the setting of data-driven discovery across 21 real-world datasets spanning domains such as biology, economics, finance, and behavioral science.
arXiv Detail & Related papers (2025-06-30T22:53:59Z) - Probing Deep into Temporal Profile Makes the Infrared Small Target Detector Much Better [63.567886330598945]
Infrared small target (IRST) detection is challenging in simultaneously achieving precise, universal, robust and efficient performance.<n>Current learning-based methods attempt to leverage more" information from both the spatial and the short-term temporal domains.<n>We propose an efficient deep temporal probe network (DeepPro) that only performs calculations in the time dimension for IRST detection.
arXiv Detail & Related papers (2025-06-15T08:19:32Z) - Long-term Causal Inference via Modeling Sequential Latent Confounding [49.64731441006396]
Long-term causal inference is an important but challenging problem across various scientific domains.<n>We propose an approach based on the Conditional Additive Equi-Confounding Bias (CAECB) assumption.<n>Our proposed assumption states a functional relationship between sequential confounding biases across temporal short-term outcomes.
arXiv Detail & Related papers (2025-02-26T09:56:56Z) - Towards Understanding Extrapolation: a Causal Lens [53.15488984371969]
We provide a theoretical understanding of when extrapolation is possible and offer principled methods to achieve it.<n>Under this formulation, we cast the extrapolation problem into a latent-variable identification problem.<n>Our theory reveals the intricate interplay between the underlying manifold's smoothness and the shift properties.
arXiv Detail & Related papers (2025-01-15T21:29:29Z) - A Central Limit Theorem for the permutation importance measure [0.44998333629984877]
We provide a formal proof of a Central Limit Theorem for RFPIM using U-Statistics theory.<n>Our result aims at improving the theoretical understanding of RFPIM rather than conducting comprehensive hypothesis testing.
arXiv Detail & Related papers (2024-12-17T15:40:21Z) - Bayesian Intervention Optimization for Causal Discovery [23.51328013481865]
Causal discovery is crucial for understanding complex systems and informing decisions.
Current methods, such as Bayesian and graph-theoretical approaches, do not prioritize decision-making.
We propose a novel Bayesian optimization-based method inspired by Bayes factors.
arXiv Detail & Related papers (2024-06-16T12:45:44Z) - Ladder-of-Thought: Using Knowledge as Steps to Elevate Stance Detection [73.31406286956535]
We introduce the Ladder-of-Thought (LoT) for the stance detection task.
LoT directs the small LMs to assimilate high-quality external knowledge, refining the intermediate rationales produced.
Our empirical evaluations underscore LoT's efficacy, marking a 16% improvement over GPT-3.5 and a 10% enhancement compared to GPT-3.5 with CoT on stance detection task.
arXiv Detail & Related papers (2023-08-31T14:31:48Z) - OptIForest: Optimal Isolation Forest for Anomaly Detection [19.38817835115542]
A category based on the isolation forest mechanism stands out due to its simplicity, effectiveness, and efficiency.
In this paper, we establish a theory on isolation efficiency to answer the question and determine the optimal branching factor for an isolation tree.
Based on the theoretical underpinning, we design a practical optimal isolation forest OptIForest incorporating clustering based learning to hash.
arXiv Detail & Related papers (2023-06-22T07:14:02Z) - Counterfactual Reasoning for Out-of-distribution Multimodal Sentiment
Analysis [56.84237932819403]
This paper aims to estimate and mitigate the bad effect of textual modality for strong OOD generalization.
Inspired by this, we devise a model-agnostic counterfactual framework for multimodal sentiment analysis.
arXiv Detail & Related papers (2022-07-24T03:57:40Z) - FACT: High-Dimensional Random Forests Inference [4.941630596191806]
Quantifying the usefulness of individual features in random forests learning can greatly enhance its interpretability.
Existing studies have shown that some popularly used feature importance measures for random forests suffer from the bias issue.
We propose a framework of the self-normalized feature-residual correlation test (FACT) for evaluating the significance of a given feature.
arXiv Detail & Related papers (2022-07-04T19:05:08Z) - Beyond Distributional Hypothesis: Let Language Models Learn Meaning-Text
Correspondence [45.9949173746044]
We show that large-size pre-trained language models (PLMs) do not satisfy the logical negation property (LNP)
We propose a novel intermediate training task, names meaning-matching, designed to directly learn a meaning-text correspondence.
We find that the task enables PLMs to learn lexical semantic information.
arXiv Detail & Related papers (2022-05-08T08:37:36Z) - Counterfactual Maximum Likelihood Estimation for Training Deep Networks [83.44219640437657]
Deep learning models are prone to learning spurious correlations that should not be learned as predictive clues.
We propose a causality-based training framework to reduce the spurious correlations caused by observable confounders.
We conduct experiments on two real-world tasks: Natural Language Inference (NLI) and Image Captioning.
arXiv Detail & Related papers (2021-06-07T17:47:16Z) - Interpretable Anomaly Detection with DIFFI: Depth-based Isolation Forest
Feature Importance [4.769747792846005]
Anomaly Detection is an unsupervised learning task aimed at detecting anomalous behaviours with respect to historical data.
The Isolation Forest is one of the most commonly adopted algorithms in the field of Anomaly Detection.
This paper proposes methods to define feature importance scores at both global and local level for the Isolation Forest.
arXiv Detail & Related papers (2020-07-21T22:19:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.