Weighted Sum-of-Trees Model for Clustered Data
- URL: http://arxiv.org/abs/2602.02931v1
- Date: Tue, 03 Feb 2026 00:04:49 GMT
- Title: Weighted Sum-of-Trees Model for Clustered Data
- Authors: Kevin McCoy, Zachary Wooten, Katarzyna Tomczak, Christine B. Peterson,
- Abstract summary: We propose a lightweight sum-of-trees model in which we learn a decision tree for each sample group.<n>We show our model outperforms traditional decision trees and random forests in a variety of simulation settings.<n>We showcase our method on real-world data from the sarcoma cohort of The Cancer Genome Atlas.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Clustered data, which arise when observations are nested within groups, are incredibly common in clinical, education, and social science research. Traditionally, a linear mixed model, which includes random effects to account for within-group correlation, would be used to model the observed data and make new predictions on unseen data. Some work has been done to extend the mixed model approach beyond linear regression into more complex and non-parametric models, such as decision trees and random forests. However, existing methods are limited to using the global fixed effects for prediction on data from out-of-sample groups, effectively assuming that all clusters share a common outcome model. We propose a lightweight sum-of-trees model in which we learn a decision tree for each sample group. We combine the predictions from these trees using weights so that out-of-sample group predictions are more closely aligned with the most similar groups in the training data. This strategy also allows for inference on the similarity across groups in the outcome prediction model, as the unique tree structures and variable importances for each group can be directly compared. We show our model outperforms traditional decision trees and random forests in a variety of simulation settings. Finally, we showcase our method on real-world data from the sarcoma cohort of The Cancer Genome Atlas, where patient samples are grouped by sarcoma subtype.
Related papers
- Learning Decision Trees as Amortized Structure Inference [59.65621207449269]
We propose a hybrid amortized structure inference approach to learn predictive decision tree ensembles given data.<n>We show that our approach, DT-GFN, outperforms state-of-the-art decision tree and deep learning methods on standard classification benchmarks.
arXiv Detail & Related papers (2025-03-10T07:05:07Z) - Challenges learning from imbalanced data using tree-based models: Prevalence estimates systematically depend on hyperparameters and can be upwardly biased [0.0]
We show that calibrating a random forest this way has negative consequences, including prevalence estimates.<n>We make a surprising discovery - contrary to the widespread belief that decision trees are biased towards the majority class, they actually can be biased towards the minority class.
arXiv Detail & Related papers (2024-12-17T19:38:29Z) - Robustly estimating heterogeneity in factorial data using Rashomon Partitions [4.76518127830168]
We propose a novel framework for model uncertainty called Rashomon Partition Sets (RPS)<n>RPS consists of all models that have posterior density close to the maximum a posteriori (MAP) model.<n>We give simulation evidence along with three empirical examples: price effects on charitable giving, heterogeneity in chromosomal structure, and the introduction of microfinance.
arXiv Detail & Related papers (2024-04-02T17:53:28Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - Composite Feature Selection using Deep Ensembles [130.72015919510605]
We investigate the problem of discovering groups of predictive features without predefined grouping.
We introduce a novel deep learning architecture that uses an ensemble of feature selection models to find predictive groups.
We propose a new metric to measure similarity between discovered groups and the ground truth.
arXiv Detail & Related papers (2022-11-01T17:49:40Z) - Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular
data [81.43750358586072]
We propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes.
We experimentally demonstrate the benefits of Data-IQ on four real-world medical datasets.
arXiv Detail & Related papers (2022-10-24T08:57:55Z) - Treeging [0.0]
Treeging combines the flexible mean structure of regression trees with the covariance-based prediction strategy of kriging into the base learner of an ensemble prediction algorithm.
We investigate the predictive accuracy of treeging across a thorough and widely varied battery of spatial and space-time simulation scenarios.
arXiv Detail & Related papers (2021-10-03T17:48:18Z) - Cross-Cluster Weighted Forests [4.9873153106566575]
This article considers the effect of ensembling Random Forest learners trained on clusters within a dataset with heterogeneity in the distribution of the features.<n>We find that constructing ensembles of forests trained on clusters determined by algorithms such as k-means results in significant improvements in accuracy and generalizability over the traditional Random Forest algorithm.
arXiv Detail & Related papers (2021-05-17T04:58:29Z) - Group Testing with a Graph Infection Spread Model [61.48558770435175]
Infection spreads via connections between individuals and this results in a probabilistic cluster formation structure as well as a non-i.i.d. infection status for individuals.
We propose a class of two-step sampled group testing algorithms where we exploit the known probabilistic infection spread model.
Our results imply that, by exploiting information on the connections of individuals, group testing can be used to reduce the number of required tests significantly even when infection rate is high.
arXiv Detail & Related papers (2021-01-14T18:51:32Z) - Inference post Selection of Group-sparse Regression Models [2.1485350418225244]
Conditional inference provides a rigorous approach to counter bias when data from automated model selections is reused for inference.
We develop in this paper a statistically consistent Bayesian framework to assess uncertainties within linear models.
Finding wide applications when genes, proteins, genetic variants, neuroimaging measurements are grouped respectively by their biological pathways, molecular functions, regulatory regions, cognitive roles, these models are selected through a useful class of group-sparse learning algorithms.
arXiv Detail & Related papers (2020-12-31T15:43:26Z) - Recommendations for Bayesian hierarchical model specifications for
case-control studies in mental health [0.0]
Researchers must choose whether to assume all subjects are drawn from a common population, or to model them as deriving from separate populations.
We ran systematic simulations on synthetic multi-group behavioural data from a commonly used bandit task.
We found that fitting groups separately provided the most accurate and robust inference across all conditions.
arXiv Detail & Related papers (2020-11-03T14:19:59Z) - Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously.
We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework.
The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.