A Semi-supervised CART Model for Covariate Shift
- URL: http://arxiv.org/abs/2410.20978v2
- Date: Sun, 22 Dec 2024 10:41:27 GMT
- Title: A Semi-supervised CART Model for Covariate Shift
- Authors: Mingyang Cai, Thomas Klausch, Mark A. van de Wiel,
- Abstract summary: This paper introduces a semi-supervised classification and regression tree (CART) that uses importance weighting to address distribution discrepancies.
Our method improves the predictive performance of the CART model by assigning greater weights to training samples.
Through simulation studies and applications to real-world medical data, we demonstrate significant improvements in predictive accuracy.
- Score: 0.0
- License:
- Abstract: Machine learning models used in medical applications often face challenges due to the covariate shift, which occurs when there are discrepancies between the distributions of training and target data. This can lead to decreased predictive accuracy, especially with unknown outcomes in the target data. This paper introduces a semi-supervised classification and regression tree (CART) that uses importance weighting to address these distribution discrepancies. Our method improves the predictive performance of the CART model by assigning greater weights to training samples that more accurately represent the target distribution, especially in cases of covariate shift without target outcomes. In addition to CART, we extend this weighted approach to generalized linear model trees and tree ensembles, creating a versatile framework for managing the covariate shift in complex datasets. Through simulation studies and applications to real-world medical data, we demonstrate significant improvements in predictive accuracy. These findings suggest that our weighted approach can enhance reliability in medical applications and other fields where the covariate shift poses challenges to model performance across various data distributions.
Related papers
- DeCaf: A Causal Decoupling Framework for OOD Generalization on Node Classification [14.96980804513399]
Graph Neural Networks (GNNs) are susceptible to distribution shifts, creating vulnerability and security issues in critical domains.
Existing methods that target learning an invariant (feature, structure)-label mapping often depend on oversimplified assumptions about the data generation process.
We introduce a more realistic graph data generation model using Structural Causal Models (SCMs)
We propose a casual decoupling framework, DeCaf, that independently learns unbiased feature-label and structure-label mappings.
arXiv Detail & Related papers (2024-10-27T00:22:18Z) - Generative Principal Component Regression via Variational Inference [2.4415762506639944]
One approach to designing appropriate manipulations is to target key features of predictive models.
We develop a novel objective based on supervised variational autoencoders (SVAEs) that enforces such information is represented in the latent space.
We show in simulations that gPCR dramatically improves target selection in manipulation as compared to standard PCR and SVAEs.
arXiv Detail & Related papers (2024-09-03T22:38:55Z) - Quantifying Distribution Shifts and Uncertainties for Enhanced Model Robustness in Machine Learning Applications [0.0]
This study explores model adaptation and generalization by utilizing synthetic data.
We employ quantitative measures such as Kullback-Leibler divergence, Jensen-Shannon distance, and Mahalanobis distance to assess data similarity.
Our findings suggest that utilizing statistical measures, such as the Mahalanobis distance, to determine whether model predictions fall within the low-error "interpolation regime" or the high-error "extrapolation regime" provides a complementary method for assessing distribution shift and model uncertainty.
arXiv Detail & Related papers (2024-05-03T10:05:31Z) - Leveraging Diffusion Disentangled Representations to Mitigate Shortcuts
in Underspecified Visual Tasks [92.32670915472099]
We propose an ensemble diversification framework exploiting the generation of synthetic counterfactuals using Diffusion Probabilistic Models (DPMs)
We show that diffusion-guided diversification can lead models to avert attention from shortcut cues, achieving ensemble diversity performance comparable to previous methods requiring additional data collection.
arXiv Detail & Related papers (2023-10-03T17:37:52Z) - Vector-Based Data Improves Left-Right Eye-Tracking Classifier
Performance After a Covariate Distributional Shift [0.0]
We propose a fine-grain data approach for EEG-ET data collection in order to create more robust benchmarking.
We train machine learning models utilizing both coarse-grain and fine-grain data and compare their accuracies when tested on data of similar/different distributional patterns.
Results showed that models trained on fine-grain, vector-based data were less susceptible to distributional shifts than models trained on coarse-grain, binary-classified data.
arXiv Detail & Related papers (2022-07-31T16:27:50Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Accuracy on the Line: On the Strong Correlation Between
Out-of-Distribution and In-Distribution Generalization [89.73665256847858]
We show that out-of-distribution performance is strongly correlated with in-distribution performance for a wide range of models and distribution shifts.
Specifically, we demonstrate strong correlations between in-distribution and out-of-distribution performance on variants of CIFAR-10 & ImageNet.
We also investigate cases where the correlation is weaker, for instance some synthetic distribution shifts from CIFAR-10-C and the tissue classification dataset Camelyon17-WILDS.
arXiv Detail & Related papers (2021-07-09T19:48:23Z) - Predicting with Confidence on Unseen Distributions [90.68414180153897]
We connect domain adaptation and predictive uncertainty literature to predict model accuracy on challenging unseen distributions.
We find that the difference of confidences (DoC) of a classifier's predictions successfully estimates the classifier's performance change over a variety of shifts.
We specifically investigate the distinction between synthetic and natural distribution shifts and observe that despite its simplicity DoC consistently outperforms other quantifications of distributional difference.
arXiv Detail & Related papers (2021-07-07T15:50:18Z) - Accurate and Robust Feature Importance Estimation under Distribution
Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method.
We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z) - Unlabelled Data Improves Bayesian Uncertainty Calibration under
Covariate Shift [100.52588638477862]
We develop an approximate Bayesian inference scheme based on posterior regularisation.
We demonstrate the utility of our method in the context of transferring prognostic models of prostate cancer across globally diverse populations.
arXiv Detail & Related papers (2020-06-26T13:50:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.