Efficient semi-supervised inference for logistic regression under
case-control studies
- URL: http://arxiv.org/abs/2402.15365v1
- Date: Fri, 23 Feb 2024 14:55:58 GMT
- Title: Efficient semi-supervised inference for logistic regression under
case-control studies
- Authors: Zhuojun Quan, Yuanyuan Lin, Kani Chen, Wen Yu
- Abstract summary: We consider an inference problem in semi-supervised settings where the outcome in the labeled data is binary.
Case-control sampling is an effective sampling scheme for alleviating imbalance structure in binary data.
We find out that with the availability of the unlabeled data, the intercept parameter can be identified in semi-supervised learning setting.
- Score: 3.5485531932219243
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Semi-supervised learning has received increasingly attention in statistics
and machine learning. In semi-supervised learning settings, a labeled data set
with both outcomes and covariates and an unlabeled data set with covariates
only are collected. We consider an inference problem in semi-supervised
settings where the outcome in the labeled data is binary and the labeled data
is collected by case-control sampling. Case-control sampling is an effective
sampling scheme for alleviating imbalance structure in binary data. Under the
logistic model assumption, case-control data can still provide consistent
estimator for the slope parameter of the regression model. However, the
intercept parameter is not identifiable. Consequently, the marginal case
proportion cannot be estimated from case-control data. We find out that with
the availability of the unlabeled data, the intercept parameter can be
identified in semi-supervised learning setting. We construct the likelihood
function of the observed labeled and unlabeled data and obtain the maximum
likelihood estimator via an iterative algorithm. The proposed estimator is
shown to be consistent, asymptotically normal, and semiparametrically
efficient. Extensive simulation studies are conducted to show the finite sample
performance of the proposed method. The results imply that the unlabeled data
not only helps to identify the intercept but also improves the estimation
efficiency of the slope parameter. Meanwhile, the marginal case proportion can
be estimated accurately by the proposed method.
Related papers
- Assumption-Lean Post-Integrated Inference with Negative Control Outcomes [0.0]
We introduce a robust post-integrated inference (PII) method that adjusts for latent heterogeneity using negative control outcomes.
Our method extends to projected direct effect estimands, accounting for hidden mediators, confounders, and moderators.
The proposed doubly robust estimators are consistent and efficient under minimal assumptions and potential misspecification.
arXiv Detail & Related papers (2024-10-07T12:52:38Z) - Statistical inference for case-control logistic regression via integrating external summary data [8.369377566749202]
Case-control sampling is a commonly used retrospective sampling design to alleviate imbalanced structure of binary data.
An empirical likelihood based approach is proposed to make inference for the logistic model by incorporating the internal case-control data and external information.
arXiv Detail & Related papers (2024-05-31T07:47:38Z) - On semi-supervised estimation using exponential tilt mixture models [12.347498345854715]
Consider a semi-supervised setting with a labeled dataset of binary responses and predictors and an unlabeled dataset with only predictors.
For semi-supervised estimation, we develop further analysis and understanding of a statistical approach using exponential tilt mixture (ETM) models.
arXiv Detail & Related papers (2023-11-14T19:53:26Z) - Adaptive Negative Evidential Deep Learning for Open-set Semi-supervised Learning [69.81438976273866]
Open-set semi-supervised learning (Open-set SSL) considers a more practical scenario, where unlabeled data and test data contain new categories (outliers) not observed in labeled data (inliers)
We introduce evidential deep learning (EDL) as an outlier detector to quantify different types of uncertainty, and design different uncertainty metrics for self-training and inference.
We propose a novel adaptive negative optimization strategy, making EDL more tailored to the unlabeled dataset containing both inliers and outliers.
arXiv Detail & Related papers (2023-03-21T09:07:15Z) - Learning to Bound Counterfactual Inference in Structural Causal Models
from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm.
The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources.
It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z) - Semi-Supervised Quantile Estimation: Robust and Efficient Inference in High Dimensional Settings [0.5735035463793009]
We consider quantile estimation in a semi-supervised setting, characterized by two available data sets.
We propose a family of semi-supervised estimators for the response quantile(s) based on the two data sets.
arXiv Detail & Related papers (2022-01-25T10:02:23Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Nonuniform Negative Sampling and Log Odds Correction with Rare Events
Data [15.696653979226113]
We investigate the issue of parameter estimation with nonuniform negative sampling for imbalanced data.
We derive a general inverse probability weighted (IPW) estimator and obtain the optimal sampling probability that minimizes its variance.
Both theoretical and empirical results demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2021-10-25T15:37:22Z) - Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation.
We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation.
Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z) - Scalable Marginal Likelihood Estimation for Model Selection in Deep
Learning [78.83598532168256]
Marginal-likelihood based model-selection is rarely used in deep learning due to estimation difficulties.
Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable.
arXiv Detail & Related papers (2021-04-11T09:50:24Z) - Balance-Subsampled Stable Prediction [55.13512328954456]
We propose a novel balance-subsampled stable prediction (BSSP) algorithm based on the theory of fractional factorial design.
A design-theoretic analysis shows that the proposed method can reduce the confounding effects among predictors induced by the distribution shift.
Numerical experiments on both synthetic and real-world data sets demonstrate that our BSSP algorithm significantly outperforms the baseline methods for stable prediction across unknown test data.
arXiv Detail & Related papers (2020-06-08T07:01:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.