Distributed Multivariate Regression Modeling For Selecting Biomarkers
Under Data Protection Constraints
- URL: http://arxiv.org/abs/1803.00422v3
- Date: Sun, 1 Oct 2023 09:41:33 GMT
- Title: Distributed Multivariate Regression Modeling For Selecting Biomarkers
Under Data Protection Constraints
- Authors: Daniela Z\"oller and Harald Binder
- Abstract summary: We propose a multivariable regression approach for identifying biomarkers by automatic variable selection based on aggregated data in iterative calls.
The approach can be used to jointly analyze data distributed across several locations.
In a simulation, the information loss introduced by local standardization is seen to be minimal.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The discovery of clinical biomarkers requires large patient cohorts and is
aided by a pooled data approach across institutions. In many countries, data
protection constraints, especially in the clinical environment, forbid the
exchange of individual-level data between different research institutes,
impeding the conduct of a joint analyses. To circumvent this problem, only
non-disclosive aggregated data is exchanged, which is often done manually and
requires explicit permission before transfer, i.e., the number of data calls
and the amount of data should be limited. This does not allow for more complex
tasks such as variable selection, as only simple aggregated summary statistics
are typically transferred. Other methods have been proposed that require more
complex aggregated data or use input data perturbation, but these methods can
either not deal with a high number of biomarkers or lose information. Here, we
propose a multivariable regression approach for identifying biomarkers by
automatic variable selection based on aggregated data in iterative calls, which
can be implemented under data protection constraints. The approach can be used
to jointly analyze data distributed across several locations. To minimize the
amount of transferred data and the number of calls, we also provide a heuristic
variant of the approach. When performing global data standardization, the
proposed method yields the same results as pooled individual-level data
analysis. In a simulation study, the information loss introduced by local
standardization is seen to be minimal. In a typical scenario, the heuristic
decreases the number of data calls from more than 10 to 3, rendering manual
data releases feasible. To make our approach widely available for application,
we provide an implementation of the heuristic version incorporated in the
DataSHIELD framework.\
Related papers
- Geometry-Aware Instrumental Variable Regression [56.16884466478886]
We propose a transport-based IV estimator that takes into account the geometry of the data manifold through data-derivative information.
We provide a simple plug-and-play implementation of our method that performs on par with related estimators in standard settings.
arXiv Detail & Related papers (2024-05-19T17:49:33Z) - Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models [25.022166664832596]
We propose a simple and effective approach to detect data contamination in large language models (LLMs) and estimate the amount of it.
We frame data contamination detection as a series of multiple-choice questions and devise a quiz format wherein three perturbed versions of each subsampled instance from a specific dataset partition are created.
Our findings suggest that DCQ achieves state-of-the-art results and uncovers greater contamination/memorization levels compared to existing methods.
arXiv Detail & Related papers (2023-11-10T18:48:58Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Mean Estimation with User-level Privacy under Data Heterogeneity [54.07947274508013]
Different users may possess vastly different numbers of data points.
It cannot be assumed that all users sample from the same underlying distribution.
We propose a simple model of heterogeneous user data that allows user data to differ in both distribution and quantity of data.
arXiv Detail & Related papers (2023-07-28T23:02:39Z) - Distributed sequential federated learning [0.0]
We develop a data-driven method for efficiently and effectively aggregating valued information by analyzing local data.
We use numerical studies of simulated data and an application to COVID-19 data collected from 32 hospitals in Mexico.
arXiv Detail & Related papers (2023-01-31T21:20:45Z) - GenSyn: A Multi-stage Framework for Generating Synthetic Microdata using
Macro Data Sources [21.32471030724983]
Individual-level data (microdata) that characterizes a population is essential for studying many real-world problems.
In this study, we examine synthetic data generation as a tool to extrapolate difficult-to-obtain high-resolution data.
arXiv Detail & Related papers (2022-12-08T01:22:12Z) - CEDAR: Communication Efficient Distributed Analysis for Regressions [9.50726756006467]
There are growing interests about distributed learning over multiple EHRs databases without sharing patient-level data.
We propose a novel communication efficient method that aggregates the local optimal estimates, by turning the problem into a missing data problem.
We provide theoretical investigation for the properties of the proposed method for statistical inference as well as differential privacy, and evaluate its performance in simulations and real data analyses.
arXiv Detail & Related papers (2022-07-01T09:53:44Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Flexible variable selection in the presence of missing data [0.0]
We propose a non-parametric variable selection algorithm combined with multiple imputation to develop flexible panels in the presence of missing-at-random data.
We show that our proposal has good operating characteristics and results in panels with higher classification and variable selection performance.
arXiv Detail & Related papers (2022-02-25T21:41:03Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - MURAL: An Unsupervised Random Forest-Based Embedding for Electronic
Health Record Data [59.26381272149325]
We present an unsupervised random forest for representing data with disparate variable types.
MURAL forests consist of a set of decision trees where node-splitting variables are chosen at random.
We show that using our approach, we can visualize and classify data more accurately than competing approaches.
arXiv Detail & Related papers (2021-11-19T22:02:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.