How to Train Private Clinical Language Models: A Comparative Study of Privacy-Preserving Pipelines for ICD-9 Coding
- URL: http://arxiv.org/abs/2511.14936v1
- Date: Tue, 18 Nov 2025 21:51:04 GMT
- Title: How to Train Private Clinical Language Models: A Comparative Study of Privacy-Preserving Pipelines for ICD-9 Coding
- Authors: Mathieu Dufour, Andrew Duncan,
- Abstract summary: Large language models trained on clinical text risk exposing sensitive patient information.<n>Despite rapid progress in DP optimisation, it remains unclear which privacy-preserving strategy actually works best.<n>Knowledge distillation from DP-trained teachers outperforms both direct DP-SGD and DP-synthetic data training.
- Score: 0.33148826359547523
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models trained on clinical text risk exposing sensitive patient information, yet differential privacy (DP) methods often severely degrade the diagnostic accuracy needed for deployment. Despite rapid progress in DP optimisation and text generation, it remains unclear which privacy-preserving strategy actually works best for clinical language tasks. We present the first systematic head-to-head comparison of four training pipelines for automated diagnostic coding from hospital discharge summaries. All pipelines use identical 1B-parameter models and matched privacy budgets to predict ICD-9 codes. At moderate and relaxed privacy budgets ($\varepsilon \in \{4, 6\}$), knowledge distillation from DP-trained teachers outperforms both direct DP-SGD and DP-synthetic data training, recovering up to 63\% of the non-private performance whilst maintaining strong empirical privacy (membership-inference AUC $\approx$ 0.5). These findings expose large differences in the privacy-utility trade-off across architectures and identify knowledge distillation as the most practical route to privacy-preserving clinical NLP.
Related papers
- PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems [0.0]
PrivMedChat is an end-to-end framework for differentially private RLHF.<n>We present PrivMedChat, an end-to-end framework for differentially private RLHF.
arXiv Detail & Related papers (2026-03-03T14:53:20Z) - Pre-training Differentially Private Models with Limited Public Data [54.943023722114134]
differential privacy (DP) is a prominent method to gauge the degree of security provided to the models.
DP is yet not capable of protecting a substantial portion of the data used during the initial pre-training stage.
We develop a novel DP continual pre-training strategy using only 10% of public data.
Our strategy can achieve DP accuracy of 41.5% on ImageNet-21k, as well as non-DP accuracy of 55.7% and and 60.0% on downstream tasks Places365 and iNaturalist-2021.
arXiv Detail & Related papers (2024-02-28T23:26:27Z) - Differentially Private Distributed Inference [2.4401219403555814]
Healthcare centers collaborating on clinical trials must balance knowledge sharing with safeguarding sensitive patient data.<n>We address this challenge by using differential privacy (DP) to control information leakage.<n>Agents update belief statistics via log-linear rules, and DP noise provides plausible deniability and rigorous performance guarantees.
arXiv Detail & Related papers (2024-02-13T01:38:01Z) - Preserving privacy in domain transfer of medical AI models comes at no
performance costs: The integral role of differential privacy [5.025818976218807]
We evaluate the efficacy of DP-enhanced domain transfer (DP-DT) in diagnosing cardiomegaly, pleural effusion, pneumonia, atelectasis, and in identifying healthy subjects.
Our results show that DP-DT, even with exceptionally high privacy levels, performs comparably to non-DP-DT.
arXiv Detail & Related papers (2023-06-10T18:41:50Z) - Private, fair and accurate: Training large-scale, privacy-preserving AI models in medical imaging [47.99192239793597]
We evaluated the effect of privacy-preserving training of AI models regarding accuracy and fairness compared to non-private training.
Our study shows that -- under the challenging realistic circumstances of a real-life clinical dataset -- the privacy-preserving training of diagnostic deep learning models is possible with excellent diagnostic accuracy and fairness.
arXiv Detail & Related papers (2023-02-03T09:49:13Z) - TAN Without a Burn: Scaling Laws of DP-SGD [70.7364032297978]
Differentially Private methods for training Deep Neural Networks (DNNs) have progressed recently.
We decouple privacy analysis and experimental behavior of noisy training to explore the trade-off with minimal computational requirements.
We apply the proposed method on CIFAR-10 and ImageNet and, in particular, strongly improve the state-of-the-art on ImageNet with a +9 points gain in top-1 accuracy.
arXiv Detail & Related papers (2022-10-07T08:44:35Z) - Individual Privacy Accounting for Differentially Private Stochastic Gradient Descent [69.14164921515949]
We characterize privacy guarantees for individual examples when releasing models trained by DP-SGD.
We find that most examples enjoy stronger privacy guarantees than the worst-case bound.
This implies groups that are underserved in terms of model utility simultaneously experience weaker privacy guarantees.
arXiv Detail & Related papers (2022-06-06T13:49:37Z) - Large Scale Transfer Learning for Differentially Private Image
Classification [51.10365553035979]
Differential Privacy (DP) provides a formal framework for training machine learning models with individual example level privacy.
Private training using DP-SGD protects against leakage by injecting noise into individual example gradients.
While this result is quite appealing, the computational cost of training large-scale models with DP-SGD is substantially higher than non-private training.
arXiv Detail & Related papers (2022-05-06T01:22:20Z) - NeuralDP Differentially private neural networks by design [61.675604648670095]
We propose NeuralDP, a technique for privatising activations of some layer within a neural network.
We experimentally demonstrate on two datasets that our method offers substantially improved privacy-utility trade-offs compared to DP-SGD.
arXiv Detail & Related papers (2021-07-30T12:40:19Z) - Chasing Your Long Tails: Differentially Private Prediction in Health
Care Settings [34.26542589537452]
Methods for differentially private (DP) learning provide a general-purpose approach to learn models with privacy guarantees.
Modern methods for DP learning ensure privacy through mechanisms that censor information judged as too unique.
We use state-of-the-art methods for DP learning to train privacy-preserving models in clinical prediction tasks.
arXiv Detail & Related papers (2020-10-13T19:56:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.