Efficient Estimation and Evaluation of Prediction Rules in
Semi-Supervised Settings under Stratified Sampling
- URL: http://arxiv.org/abs/2010.09443v2
- Date: Sat, 25 Sep 2021 13:53:42 GMT
- Title: Efficient Estimation and Evaluation of Prediction Rules in
Semi-Supervised Settings under Stratified Sampling
- Authors: Jessica Gronsbell and Molei Liu and Lu Tian and Tianxi Cai
- Abstract summary: We propose a two-step semi-supervised learning (SSL) procedure for evaluating a prediction rule derived from a working binary regression model.
In step I, we impute the missing labels via weighted regression with nonlinear basis functions to account for nonrandom sampling.
In step II, we augment the initial imputations to ensure the consistency of the resulting estimators.
- Score: 6.930951733450623
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In many contemporary applications, large amounts of unlabeled data are
readily available while labeled examples are limited. There has been
substantial interest in semi-supervised learning (SSL) which aims to leverage
unlabeled data to improve estimation or prediction. However, current SSL
literature focuses primarily on settings where labeled data is selected
randomly from the population of interest. Non-random sampling, while posing
additional analytical challenges, is highly applicable to many real world
problems. Moreover, no SSL methods currently exist for estimating the
prediction performance of a fitted model under non-random sampling. In this
paper, we propose a two-step SSL procedure for evaluating a prediction rule
derived from a working binary regression model based on the Brier score and
overall misclassification rate under stratified sampling. In step I, we impute
the missing labels via weighted regression with nonlinear basis functions to
account for nonrandom sampling and to improve efficiency. In step II, we
augment the initial imputations to ensure the consistency of the resulting
estimators regardless of the specification of the prediction model or the
imputation model. The final estimator is then obtained with the augmented
imputations. We provide asymptotic theory and numerical studies illustrating
that our proposals outperform their supervised counterparts in terms of
efficiency gain. Our methods are motivated by electronic health records (EHR)
research and validated with a real data analysis of an EHR-based study of
diabetic neuropathy.
Related papers
- Semi-supervised Regression Analysis with Model Misspecification and High-dimensional Data [8.619243141968886]
We present an inference framework for estimating regression coefficients in conditional mean models.
We develop an augmented inverse probability weighted (AIPW) method, employing regularized estimators for both propensity score (PS) and outcome regression (OR) models.
Our theoretical findings are verified through extensive simulation studies and a real-world data application.
arXiv Detail & Related papers (2024-06-20T00:34:54Z) - Uncertainty-Calibrated Test-Time Model Adaptation without Forgetting [55.17761802332469]
Test-time adaptation (TTA) seeks to tackle potential distribution shifts between training and test data by adapting a given model w.r.t. any test sample.
Prior methods perform backpropagation for each test sample, resulting in unbearable optimization costs to many applications.
We propose an Efficient Anti-Forgetting Test-Time Adaptation (EATA) method which develops an active sample selection criterion to identify reliable and non-redundant samples.
arXiv Detail & Related papers (2024-03-18T05:49:45Z) - Calibrating doubly-robust estimators with unbalanced treatment assignment [0.0]
We propose a simple extension of the DML estimator which undersamples data for propensity score modeling.
The paper provides theoretical results showing that the estimator retains the estimator's properties and calibrates scores to match the original distribution.
arXiv Detail & Related papers (2024-03-03T18:40:11Z) - Taming Overconfident Prediction on Unlabeled Data from Hindsight [50.9088560433925]
Minimizing prediction uncertainty on unlabeled data is a key factor to achieve good performance in semi-supervised learning.
This paper proposes a dual mechanism, named ADaptive Sharpening (ADS), which first applies a soft-threshold to adaptively mask out determinate and negligible predictions.
ADS significantly improves the state-of-the-art SSL methods by making it a plug-in.
arXiv Detail & Related papers (2021-12-15T15:17:02Z) - A comparison of approaches to improve worst-case predictive model
performance over patient subpopulations [14.175321968797252]
Predictive models for clinical outcomes that are accurate on average in a patient population may underperform drastically for some subpopulations.
We identify approaches for model development and selection that consistently improve disaggregated and worst-case performance over subpopulations.
We find that, with relatively few exceptions, no approach performs better, for each patient subpopulation examined, than standard learning procedures.
arXiv Detail & Related papers (2021-08-27T13:10:00Z) - Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation.
We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation.
Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z) - Surrogate Assisted Semi-supervised Inference for High Dimensional Risk
Prediction [3.10560974227074]
We develop a surrogate assisted semi-supervised-learning (SAS) approach to risk modeling with high dimensional predictors.
We demonstrate that the SAS procedure provides valid inference for the predicted risk derived from a high dimensional working model.
arXiv Detail & Related papers (2021-05-04T03:08:51Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Increasing the efficiency of randomized trial estimates via linear
adjustment for a prognostic score [59.75318183140857]
Estimating causal effects from randomized experiments is central to clinical research.
Most methods for historical borrowing achieve reductions in variance by sacrificing strict type-I error rate control.
arXiv Detail & Related papers (2020-12-17T21:10:10Z) - Semi-Supervised Empirical Risk Minimization: Using unlabeled data to
improve prediction [4.860671253873579]
We present a general methodology for using unlabeled data to design semi supervised learning (SSL) variants of the Empirical Risk Minimization (ERM) learning process.
We analyze of the effectiveness of our SSL approach in improving prediction performance.
arXiv Detail & Related papers (2020-09-01T17:55:51Z) - Balance-Subsampled Stable Prediction [55.13512328954456]
We propose a novel balance-subsampled stable prediction (BSSP) algorithm based on the theory of fractional factorial design.
A design-theoretic analysis shows that the proposed method can reduce the confounding effects among predictors induced by the distribution shift.
Numerical experiments on both synthetic and real-world data sets demonstrate that our BSSP algorithm significantly outperforms the baseline methods for stable prediction across unknown test data.
arXiv Detail & Related papers (2020-06-08T07:01:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.