Cross-Cluster Weighted Forests
- URL: http://arxiv.org/abs/2105.07610v1
- Date: Mon, 17 May 2021 04:58:29 GMT
- Title: Cross-Cluster Weighted Forests
- Authors: Maya Ramchandran, Rajarshi Mukherjee, and Giovanni Parmigiani
- Abstract summary: This article considers the effect of ensembling Random Forest learners trained on clusters within a single dataset with heterogeneity in the distribution of the features.
We find that constructing ensembles of forests trained on clusters determined by algorithms such as k-means results in significant improvements in accuracy and generalizability over the traditional Random Forest algorithm.
- Score: 2.099922236065961
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adapting machine learning algorithms to better handle the presence of natural
clustering or batch effects within training datasets is imperative across a
wide variety of biological applications. This article considers the effect of
ensembling Random Forest learners trained on clusters within a single dataset
with heterogeneity in the distribution of the features. We find that
constructing ensembles of forests trained on clusters determined by algorithms
such as k-means results in significant improvements in accuracy and
generalizability over the traditional Random Forest algorithm. We denote our
novel approach as the Cross-Cluster Weighted Forest, and examine its robustness
to various data-generating scenarios and outcome models. Furthermore, we
explore the influence of the data-partitioning and ensemble weighting
strategies on conferring the benefits of our method over the existing paradigm.
Finally, we apply our approach to cancer molecular profiling and gene
expression datasets that are naturally divisible into clusters and illustrate
that our approach outperforms classic Random Forest. Code and supplementary
material are available at https://github.com/m-ramchandran/cross-cluster.
Related papers
- Enabling Mixed Effects Neural Networks for Diverse, Clustered Data Using Monte Carlo Methods [9.035959289139102]
Mixed effects neural networks (MENNs) separate cluster-specific 'random effects' from cluster-invariant 'fixed effects'
We present MC-GMENN, a novel approach employing Monte Carlo methods to train Generalized Mixed Effects Neural Networks.
arXiv Detail & Related papers (2024-07-01T09:24:04Z) - Feature graphs for interpretable unsupervised tree ensembles: centrality, interaction, and application in disease subtyping [0.24578723416255746]
Feature selection assumes a pivotal role in enhancing model interpretability.
The accuracy gained from aggregating decision trees comes at the expense of interpretability.
The study introduces novel methods to construct feature graphs from unsupervised random forests.
arXiv Detail & Related papers (2024-04-27T12:47:37Z) - GCC: Generative Calibration Clustering [55.44944397168619]
We propose a novel Generative Clustering (GCC) method to incorporate feature learning and augmentation into clustering procedure.
First, we develop a discrimirative feature alignment mechanism to discover intrinsic relationship across real and generated samples.
Second, we design a self-supervised metric learning to generate more reliable cluster assignment.
arXiv Detail & Related papers (2024-04-14T01:51:11Z) - Federated unsupervised random forest for privacy-preserving patient
stratification [0.4499833362998487]
We introduce a novel multi-omics clustering approach utilizing unsupervised random-forests.
We have validated our approach on machine learning benchmark data sets and on cancer data from The Cancer Genome Atlas.
Our method is competitive with the state-of-the-art in terms of disease subtyping, but at the same time substantially improves the cluster interpretability.
arXiv Detail & Related papers (2024-01-29T12:04:14Z) - Hierarchical clustering with dot products recovers hidden tree structure [53.68551192799585]
In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure.
We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance.
We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model.
arXiv Detail & Related papers (2023-05-24T11:05:12Z) - Unsupervised Clustered Federated Learning in Complex Multi-source
Acoustic Environments [75.8001929811943]
We introduce a realistic and challenging, multi-source and multi-room acoustic environment.
We present an improved clustering control strategy that takes into account the variability of the acoustic scene.
The proposed approach is optimized using clustering-based measures and validated via a network-wide classification task.
arXiv Detail & Related papers (2021-06-07T14:51:39Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Unsupervised Multi-view Clustering by Squeezing Hybrid Knowledge from
Cross View and Each View [68.88732535086338]
This paper proposes a new multi-view clustering method, low-rank subspace multi-view clustering based on adaptive graph regularization.
Experimental results for five widely used multi-view benchmarks show that our proposed algorithm surpasses other state-of-the-art methods by a clear margin.
arXiv Detail & Related papers (2020-08-23T08:25:06Z) - Siloed Federated Learning for Multi-Centric Histopathology Datasets [0.17842332554022694]
This paper proposes a novel federated learning approach for deep learning architectures in the medical domain.
Local-statistic batch normalization (BN) layers are introduced, resulting in collaboratively-trained, yet center-specific models.
We benchmark the proposed method on the classification of tumorous histopathology image patches extracted from the Camelyon16 and Camelyon17 datasets.
arXiv Detail & Related papers (2020-08-17T15:49:30Z) - Elastic Coupled Co-clustering for Single-Cell Genomic Data [0.0]
Single-cell technologies have enabled us to profile genomic features at unprecedented resolution.
Data integration can potentially lead to a better performance of clustering algorithms.
In this work, we formulate the problem in an unsupervised transfer learning framework.
arXiv Detail & Related papers (2020-03-29T08:21:53Z) - Clustering Binary Data by Application of Combinatorial Optimization
Heuristics [52.77024349608834]
We study clustering methods for binary data, first defining aggregation criteria that measure the compactness of clusters.
Five new and original methods are introduced, using neighborhoods and population behavior optimization metaheuristics.
From a set of 16 data tables generated by a quasi-Monte Carlo experiment, a comparison is performed for one of the aggregations using L1 dissimilarity, with hierarchical clustering, and a version of k-means: partitioning around medoids or PAM.
arXiv Detail & Related papers (2020-01-06T23:33:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.