Distributed learning optimisation of Cox models can leak patient data:
Risks and solutions
- URL: http://arxiv.org/abs/2204.05856v1
- Date: Tue, 12 Apr 2022 14:56:20 GMT
- Title: Distributed learning optimisation of Cox models can leak patient data:
Risks and solutions
- Authors: Carsten Brink (1,2) and Christian R{\o}nn Hansen (1,2) and Matthew
Field (3,4) and Gareth Price (5) and David Thwaites (6) and Nis Sarup (1) and
Uffe Bernchou (1,2) and Lois Holloway (3,4,6,7) ((1) Laboratory of Radiation
Physics, Department of Oncology, Odense University Hospital, Odense, Denmark,
(2) Department of Clinical Research, University of Southern Denmark, Odense,
Denmark, (3) South Western Sydney Clinical School, Faculty of Medicine, UNSW,
Sydney, New South Wales, Australia, (4) Ingham Institute for Applied Medical
Research, Liverpool, New South Wales, Australia, (5) The University of
Manchester, Manchester Academic Health Science Centre, The Christie NHS
Foundation Trust, Manchester, UK, (6) Institute of Medical Physics, School of
Physics, University of Sydney, Sydney, New South Wales, Australia, (7)
Liverpool and Macarthur Cancer Therapy Centres, Liverpool, New South Wales,
Australia)
- Abstract summary: This paper demonstrates that the optimisation of a Cox survival model can lead to patient data leakage.
We suggest a way to optimise and validate a Cox model that avoids these problems in a secure way.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Medical data are often highly sensitive, and frequently there are missing
data. Due to the data's sensitive nature, there is an interest in creating
modelling methods where the data are kept in each local centre to preserve
their privacy, but yet the model can be trained on and learn from data across
multiple centres. Such an approach might be distributed machine learning
(federated learning, collaborative learning) in which a model is iteratively
calculated based on aggregated local model information from each centre.
However, even though no specific data are leaving the centre, there is a
potential risk that the exchanged information is sufficient to reconstruct all
or part of the patient data, which would hamper the safety-protecting rationale
idea of distributed learning. This paper demonstrates that the optimisation of
a Cox survival model can lead to patient data leakage. Following this, we
suggest a way to optimise and validate a Cox model that avoids these problems
in a secure way. The feasibility of the suggested method is demonstrated in a
provided Matlab code that also includes methods for handling missing data.
Related papers
- Personalized Federated Learning via Active Sampling [50.456464838807115]
This paper proposes a novel method for sequentially identifying similar (or relevant) data generators.
Our method evaluates the relevance of a data generator by evaluating the effect of a gradient step using its local dataset.
We extend this method to non-parametric models by a suitable generalization of the gradient step to update a hypothesis using the local dataset provided by a data generator.
arXiv Detail & Related papers (2024-09-03T17:12:21Z) - Data Encoding For Healthcare Data Democratisation and Information
Leakage Prevention [23.673071967945358]
This paper argues that irreversible data encoding can provide an effective solution to achieve data democratization.
It exploits random projections and random quantum encoding to realize this framework for dense and longitudinal or time-series data.
Experimental evaluation highlights that models trained on encoded time-series data effectively uphold the information bottleneck principle.
arXiv Detail & Related papers (2023-05-05T17:50:50Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Decentralized Distributed Learning with Privacy-Preserving Data
Synthesis [9.276097219140073]
In the medical field, multi-center collaborations are often sought to yield more generalizable findings by leveraging the heterogeneity of patient and clinical data.
Recent privacy regulations hinder the possibility to share data, and consequently, to come up with machine learning-based solutions that support diagnosis and prognosis.
We present a decentralized distributed method that integrates features from local nodes, providing models able to generalize across multiple datasets while maintaining privacy.
arXiv Detail & Related papers (2022-06-20T23:49:38Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Decentralized federated learning of deep neural networks on non-iid data [0.6335848702857039]
We tackle the non-problem of learning a personalized deep learning model in a decentralized setting.
We propose a method named Performance-Based Neighbor Selection (PENS) where clients with similar data detect each other and cooperate.
PENS is able to achieve higher accuracies as compared to strong baselines.
arXiv Detail & Related papers (2021-07-18T19:05:44Z) - Dataset Cartography: Mapping and Diagnosing Datasets with Training
Dynamics [118.75207687144817]
We introduce Data Maps, a model-based tool to characterize and diagnose datasets.
We leverage a largely ignored source of information: the behavior of the model on individual instances during training.
Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.
arXiv Detail & Related papers (2020-09-22T20:19:41Z) - Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines.
Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z) - Learning while Respecting Privacy and Robustness to Distributional
Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model.
The objective is to endow the trained model with robustness against adversarially manipulated input data.
Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z) - Federated Survival Analysis with Discrete-Time Cox Models [0.46331617589391827]
We build machine learning models from decentralized datasets located in different centers with federated learning (FL)
We show that the resulting model may suffer from important performance loss in some adverse settings.
Using this approach, we train survival models using standard FL techniques on synthetic data, as well as real-world datasets from The Cancer Genome Atlas (TCGA)
arXiv Detail & Related papers (2020-06-16T08:53:19Z) - Have you forgotten? A method to assess if machine learning models have
forgotten data [20.9131206112401]
In the era of deep learning, aggregation of data from several sources is a common approach to ensuring data diversity.
In this paper, we want to address the challenging question of whether data have been forgotten by a model.
We establish statistical methods that compare the target's outputs with outputs of models trained with different datasets.
arXiv Detail & Related papers (2020-04-21T16:13:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.