Forest-Guided Clustering -- Shedding Light into the Random Forest Black Box
- URL: http://arxiv.org/abs/2507.19455v1
- Date: Fri, 25 Jul 2025 17:41:39 GMT
- Title: Forest-Guided Clustering -- Shedding Light into the Random Forest Black Box
- Authors: Lisa Barros de Andrade e Sousa, Gregor Miller, Ronan Le Gleut, Dominik Thalmeier, Helena Pelin, Marie Piraud,
- Abstract summary: We present Forest-Guided Clustering (FGC), a model-specific explainability method that reveals both local and global structure in Random Forests by grouping instances according to shared decision paths.<n>FGC produces human-interpretable clusters aligned with the model's internal logic and computes cluster-specific and global feature importance scores to derive decision rules underlying RF predictions.<n> Applied to an AML transcriptomic dataset, FGC uncovered biologically coherent subpopulations, disentangled disease-relevant signals from confounders, and recovered known and novel gene expression patterns.
- Score: 0.6652172511473786
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: As machine learning models are increasingly deployed in sensitive application areas, the demand for interpretable and trustworthy decision-making has increased. Random Forests (RF), despite their widespread use and strong performance on tabular data, remain difficult to interpret due to their ensemble nature. We present Forest-Guided Clustering (FGC), a model-specific explainability method that reveals both local and global structure in RFs by grouping instances according to shared decision paths. FGC produces human-interpretable clusters aligned with the model's internal logic and computes cluster-specific and global feature importance scores to derive decision rules underlying RF predictions. FGC accurately recovered latent subclass structure on a benchmark dataset and outperformed classical clustering and post-hoc explanation methods. Applied to an AML transcriptomic dataset, FGC uncovered biologically coherent subpopulations, disentangled disease-relevant signals from confounders, and recovered known and novel gene expression patterns. FGC bridges the gap between performance and interpretability by providing structure-aware insights that go beyond feature-level attribution.
Related papers
- ReDiSC: A Reparameterized Masked Diffusion Model for Scalable Node Classification with Structured Predictions [64.17845687013434]
We propose ReDiSC, a structured diffusion model for structured node classification.<n>We show that ReDiSC achieves superior or highly competitive performance compared to state-of-the-art GNN, label propagation, and diffusion-based baselines.<n> Notably, ReDiSC scales effectively to large-scale datasets on which previous structured diffusion methods fail due to computational constraints.
arXiv Detail & Related papers (2025-07-19T04:46:53Z) - Learning Decision Trees as Amortized Structure Inference [59.65621207449269]
We propose a hybrid amortized structure inference approach to learn predictive decision tree ensembles given data.<n>We show that our approach, DT-GFN, outperforms state-of-the-art decision tree and deep learning methods on standard classification benchmarks.
arXiv Detail & Related papers (2025-03-10T07:05:07Z) - Interaction-Aware Gaussian Weighting for Clustered Federated Learning [58.92159838586751]
Federated Learning (FL) emerged as a decentralized paradigm to train models while preserving privacy.<n>We propose a novel clustered FL method, FedGWC (Federated Gaussian Weighting Clustering), which groups clients based on their data distribution.<n>Our experiments on benchmark datasets show that FedGWC outperforms existing FL algorithms in cluster quality and classification accuracy.
arXiv Detail & Related papers (2025-02-05T16:33:36Z) - DeCaf: A Causal Decoupling Framework for OOD Generalization on Node Classification [14.96980804513399]
Graph Neural Networks (GNNs) are susceptible to distribution shifts, creating vulnerability and security issues in critical domains.
Existing methods that target learning an invariant (feature, structure)-label mapping often depend on oversimplified assumptions about the data generation process.
We introduce a more realistic graph data generation model using Structural Causal Models (SCMs)
We propose a casual decoupling framework, DeCaf, that independently learns unbiased feature-label and structure-label mappings.
arXiv Detail & Related papers (2024-10-27T00:22:18Z) - Federated unsupervised random forest for privacy-preserving patient
stratification [0.4499833362998487]
We introduce a novel multi-omics clustering approach utilizing unsupervised random-forests.
We have validated our approach on machine learning benchmark data sets and on cancer data from The Cancer Genome Atlas.
Our method is competitive with the state-of-the-art in terms of disease subtyping, but at the same time substantially improves the cluster interpretability.
arXiv Detail & Related papers (2024-01-29T12:04:14Z) - Consistency Regularization for Generalizable Source-free Domain
Adaptation [62.654883736925456]
Source-free domain adaptation (SFDA) aims to adapt a well-trained source model to an unlabelled target domain without accessing the source dataset.
Existing SFDA methods ONLY assess their adapted models on the target training set, neglecting the data from unseen but identically distributed testing sets.
We propose a consistency regularization framework to develop a more generalizable SFDA method.
arXiv Detail & Related papers (2023-08-03T07:45:53Z) - FSAR: Federated Skeleton-based Action Recognition with Adaptive Topology
Structure and Knowledge Distillation [23.0771949978506]
Existing skeleton-based action recognition methods typically follow a centralized learning paradigm, which can pose privacy concerns when exposing human-related videos.
We introduce a novel Federated Skeleton-based Action Recognition (FSAR) paradigm, which enables the construction of a globally generalized model without accessing local sensitive data.
arXiv Detail & Related papers (2023-06-19T16:18:14Z) - Learning for Transductive Threshold Calibration in Open-World Recognition [83.35320675679122]
We introduce OpenGCN, a Graph Neural Network-based transductive threshold calibration method with enhanced robustness and adaptability.
Experiments across open-world visual recognition benchmarks validate OpenGCN's superiority over existing posthoc calibration methods for open-world threshold calibration.
arXiv Detail & Related papers (2023-05-19T23:52:48Z) - Chaos to Order: A Label Propagation Perspective on Source-Free Domain
Adaptation [8.27771856472078]
We present Chaos to Order (CtO), a novel approach for source-free domain adaptation (SFDA)
CtO strives to constrain semantic credibility and propagate label information among target subpopulations.
Empirical evidence demonstrates that CtO outperforms the state of the arts on three public benchmarks.
arXiv Detail & Related papers (2023-01-20T03:39:35Z) - Accuracy on the Line: On the Strong Correlation Between
Out-of-Distribution and In-Distribution Generalization [89.73665256847858]
We show that out-of-distribution performance is strongly correlated with in-distribution performance for a wide range of models and distribution shifts.
Specifically, we demonstrate strong correlations between in-distribution and out-of-distribution performance on variants of CIFAR-10 & ImageNet.
We also investigate cases where the correlation is weaker, for instance some synthetic distribution shifts from CIFAR-10-C and the tissue classification dataset Camelyon17-WILDS.
arXiv Detail & Related papers (2021-07-09T19:48:23Z) - Cross-Cluster Weighted Forests [4.9873153106566575]
This article considers the effect of ensembling Random Forest learners trained on clusters within a dataset with heterogeneity in the distribution of the features.<n>We find that constructing ensembles of forests trained on clusters determined by algorithms such as k-means results in significant improvements in accuracy and generalizability over the traditional Random Forest algorithm.
arXiv Detail & Related papers (2021-05-17T04:58:29Z) - Towards Uncovering the Intrinsic Data Structures for Unsupervised Domain
Adaptation using Structurally Regularized Deep Clustering [119.88565565454378]
Unsupervised domain adaptation (UDA) is to learn classification models that make predictions for unlabeled data on a target domain.
We propose a hybrid model of Structurally Regularized Deep Clustering, which integrates the regularized discriminative clustering of target data with a generative one.
Our proposed H-SRDC outperforms all the existing methods under both the inductive and transductive settings.
arXiv Detail & Related papers (2020-12-08T08:52:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.