Group Probability-Weighted Tree Sums for Interpretable Modeling of
Heterogeneous Data
- URL: http://arxiv.org/abs/2205.15135v1
- Date: Mon, 30 May 2022 14:27:19 GMT
- Title: Group Probability-Weighted Tree Sums for Interpretable Modeling of
Heterogeneous Data
- Authors: Keyan Nasseri, Chandan Singh, James Duncan, Aaron Kornblith, Bin Yu
- Abstract summary: Group Probability-Weighted Tree Sums (G-FIGS) achieves state-of-the-art prediction performance on important clinical datasets.
G-FIGS increases specificity for identifying cervical spine injury by up to 10% over CART and up to 3% over FIGS alone.
All code, data, and models are released on Github.
- Score: 9.99624617629557
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning in high-stakes domains, such as healthcare, faces two
critical challenges: (1) generalizing to diverse data distributions given
limited training data while (2) maintaining interpretability. To address these
challenges, we propose an instance-weighted tree-sum method that effectively
pools data across diverse groups to output a concise, rule-based model. Given
distinct groups of instances in a dataset (e.g., medical patients grouped by
age or treatment site), our method first estimates group membership
probabilities for each instance. Then, it uses these estimates as instance
weights in FIGS (Tan et al. 2022), to grow a set of decision trees whose values
sum to the final prediction. We call this new method Group Probability-Weighted
Tree Sums (G-FIGS). G-FIGS achieves state-of-the-art prediction performance on
important clinical datasets; e.g., holding the level of sensitivity fixed at
92%, G-FIGS increases specificity for identifying cervical spine injury by up
to 10% over CART and up to 3% over FIGS alone, with larger gains at higher
sensitivity levels. By keeping the total number of rules below 16 in FIGS, the
final models remain interpretable, and we find that their rules match medical
domain expertise. All code, data, and models are released on Github.
Related papers
- Advancing Tabular Stroke Modelling Through a Novel Hybrid Architecture and Feature-Selection Synergy [0.9999629695552196]
The present work develops and validates a data-driven and interpretable machine-learning framework designed to predict strokes.<n>Ten routinely gathered demographic, lifestyle, and clinical variables were sourced from a public cohort of 4,981 records.<n>The proposed model achieved an accuracy rate of 97.2% and an F1-score of 97.15%, indicating a significant enhancement compared to the leading individual model.
arXiv Detail & Related papers (2025-05-18T21:46:45Z) - Learning Decision Trees as Amortized Structure Inference [59.65621207449269]
We propose a hybrid amortized structure inference approach to learn predictive decision tree ensembles given data.
We show that our approach, DT-GFN, outperforms state-of-the-art decision tree and deep learning methods on standard classification benchmarks.
arXiv Detail & Related papers (2025-03-10T07:05:07Z) - Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank [69.90493129893112]
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals.
Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data.
arXiv Detail & Related papers (2024-04-26T16:39:50Z) - Multinomial belief networks for healthcare data [0.0]
We propose a deep generative model for augmentation of sample sizes and uncertainty.
We show that we can identify meaningful clusters of DNA mutations in cancer and show that we can identify meaningful signatures in a fully data-driven way.
arXiv Detail & Related papers (2023-11-28T16:12:50Z) - Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular
data [81.43750358586072]
We propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes.
We experimentally demonstrate the benefits of Data-IQ on four real-world medical datasets.
arXiv Detail & Related papers (2022-10-24T08:57:55Z) - How Do Graph Networks Generalize to Large and Diverse Molecular Systems? [10.690849483282564]
We identify four aspects of complexity in which many datasets are lacking.
We propose the GemNet-OC model, which outperforms the previous state-of-the-art on OC20 by 16%.
Our findings challenge the common belief that graph neural networks work equally well independent of dataset size and diversity.
arXiv Detail & Related papers (2022-04-06T12:52:34Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Fast Interpretable Greedy-Tree Sums [8.268938983372452]
Fast Interpretable Greedy-Tree Sums (FIGS) generalizes the CART algorithm to grow a flexible number of trees in summation.
G-FIGS derives CDIs that reflect domain knowledge and enjoy improved specificity (by up to 20% over CART) without sacrificing sensitivity or interpretability.
Bagging-FIGS enjoys competitive performance with random forests and XGBoost on real-world datasets.
arXiv Detail & Related papers (2022-01-28T04:50:37Z) - Evaluation of data imputation strategies in complex, deeply-phenotyped
data sets: the case of the EU-AIMS Longitudinal European Autism Project [0.0]
We evaluate different imputation strategies to fill in missing values in clinical data from a large (total N=764) dataset.
We consider a total of 160 clinical measures divided in 15 overlapping subsets of participants.
arXiv Detail & Related papers (2022-01-20T21:50:38Z) - Generalizing electrocardiogram delineation: training convolutional
neural networks with synthetic data augmentation [63.51064808536065]
Existing databases for ECG delineation are small, being insufficient in size and in the array of pathological conditions they represent.
This article delves has two main contributions. First, a pseudo-synthetic data generation algorithm was developed, based in probabilistically composing ECG traces given "pools" of fundamental segments, as cropped from the original databases, and a set of rules for their arrangement into coherent synthetic traces.
Second, two novel segmentation-based loss functions have been developed, which attempt at enforcing the prediction of an exact number of independent structures and at producing closer segmentation boundaries by focusing on a reduced number of samples.
arXiv Detail & Related papers (2021-11-25T10:11:41Z) - MURAL: An Unsupervised Random Forest-Based Embedding for Electronic
Health Record Data [59.26381272149325]
We present an unsupervised random forest for representing data with disparate variable types.
MURAL forests consist of a set of decision trees where node-splitting variables are chosen at random.
We show that using our approach, we can visualize and classify data more accurately than competing approaches.
arXiv Detail & Related papers (2021-11-19T22:02:21Z) - Cohort Bias Adaptation in Aggregated Datasets for Lesion Segmentation [0.8466401378239363]
We propose a generalized affine conditioning framework to learn and account for cohort biases across multi-source datasets.
We show that our cohort bias adaptation method improves performance of the network on pooled datasets.
arXiv Detail & Related papers (2021-08-02T08:32:57Z) - Can x2vec Save Lives? Integrating Graph and Language Embeddings for
Automatic Mental Health Classification [91.3755431537592]
I show how merging graph and language embedding models (metapath2vec and doc2vec) avoids resource limits.
When integrated, both data produce highly accurate predictions (90%, with 10% false-positives and 12% false-negatives)
These results extend research on the importance of simultaneously analyzing behavior and language in massive networks.
arXiv Detail & Related papers (2020-01-04T20:56:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.