CEDAR: Communication Efficient Distributed Analysis for Regressions
- URL: http://arxiv.org/abs/2207.00306v1
- Date: Fri, 1 Jul 2022 09:53:44 GMT
- Title: CEDAR: Communication Efficient Distributed Analysis for Regressions
- Authors: Changgee Chang, Zhiqi Bu, Qi Long
- Abstract summary: There are growing interests about distributed learning over multiple EHRs databases without sharing patient-level data.
We propose a novel communication efficient method that aggregates the local optimal estimates, by turning the problem into a missing data problem.
We provide theoretical investigation for the properties of the proposed method for statistical inference as well as differential privacy, and evaluate its performance in simulations and real data analyses.
- Score: 9.50726756006467
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Electronic health records (EHRs) offer great promises for advancing precision
medicine and, at the same time, present significant analytical challenges.
Particularly, it is often the case that patient-level data in EHRs cannot be
shared across institutions (data sources) due to government regulations and/or
institutional policies. As a result, there are growing interests about
distributed learning over multiple EHRs databases without sharing patient-level
data. To tackle such challenges, we propose a novel communication efficient
method that aggregates the local optimal estimates, by turning the problem into
a missing data problem. In addition, we propose incorporating posterior samples
of remote sites, which can provide partial information on the missing
quantities and improve efficiency of parameter estimates while having the
differential privacy property and thus reducing the risk of information
leaking. The proposed approach, without sharing the raw patient level data,
allows for proper statistical inference and can accommodate sparse regressions.
We provide theoretical investigation for the asymptotic properties of the
proposed method for statistical inference as well as differential privacy, and
evaluate its performance in simulations and real data analyses in comparison
with several recently developed methods.
Related papers
- Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models [69.06149482021071]
We propose a novel EHR data generation model called EHRPD.
It is a diffusion-based model designed to predict the next visit based on the current one while also incorporating time interval estimation.
We conduct experiments on two public datasets and evaluate EHRPD from fidelity, privacy, and utility perspectives.
arXiv Detail & Related papers (2024-06-20T02:20:23Z) - Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation [62.2436697657307]
Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data.
We propose a method called Stratified Prediction-Powered Inference (StratPPI)
We show that the basic PPI estimates can be considerably improved by employing simple data stratification strategies.
arXiv Detail & Related papers (2024-06-06T17:37:39Z) - Geometry-Aware Instrumental Variable Regression [56.16884466478886]
We propose a transport-based IV estimator that takes into account the geometry of the data manifold through data-derivative information.
We provide a simple plug-and-play implementation of our method that performs on par with related estimators in standard settings.
arXiv Detail & Related papers (2024-05-19T17:49:33Z) - Multi-Source Conformal Inference Under Distribution Shift [41.701790856201036]
We consider the problem of obtaining distribution-free prediction intervals for a target population, leveraging multiple potentially biased data sources.
We derive the efficient influence functions for the quantiles of unobserved outcomes in the target and source populations.
We propose a data-adaptive strategy to upweight informative data sources for efficiency gain and downweight non-informative data sources for bias reduction.
arXiv Detail & Related papers (2024-05-15T13:33:09Z) - Reliable Generation of Privacy-preserving Synthetic Electronic Health Record Time Series via Diffusion Models [4.240899165468488]
Electronic Health Records (EHRs) are rich sources of patient-level data, offering valuable resources for medical data analysis.
However, privacy concerns often restrict access to EHRs, hindering downstream analysis.
This study aims to overcome these challenges by generating realistic and privacy-preserving synthetic EHR time series efficiently.
arXiv Detail & Related papers (2023-10-23T18:56:01Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Data-pooling Reinforcement Learning for Personalized Healthcare
Intervention [20.436521180168455]
We develop a novel data-pooling reinforcement learning (RL) algorithm based on a general perturbed value iteration framework.
Our algorithm adaptively pools historical data, with three main innovations: (i) the weight of pooling ties directly to the performance of decision (measured by regret) as opposed to estimation accuracy in conventional methods.
We substantiate the theoretical development with empirically better performance of our algorithm via a case study in the context of post-discharge intervention to prevent unplanned readmissions.
arXiv Detail & Related papers (2022-11-16T15:52:49Z) - Collaborative causal inference on distributed data [7.293479909193382]
We propose a data collaboration quasi-experiment (DC-QE) that resolves the lack of both subjects and covariates, reducing random errors and biases in the estimation.
Our method involves constructing dimensionality-reduced intermediate representations from private data from local parties, sharing intermediate representations instead of private data for privacy preservation, estimating propensity scores from the shared intermediate representations, and finally, estimating the treatment effects from propensity scores.
arXiv Detail & Related papers (2022-08-16T18:28:56Z) - Towards Handling Uncertainty-at-Source in AI -- A Review and Next Steps
for Interval Regression [6.166295570030645]
This paper focuses on linear regression for interval-valued data as a recent growth area.
We conduct an in-depth analysis of state-of-the-art methods, elucidating their behaviour, advantages, and pitfalls when applied to datasets with different properties.
arXiv Detail & Related papers (2021-04-15T05:31:10Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.