Term2Note: Synthesising Differentially Private Clinical Notes from Medical Terms
- URL: http://arxiv.org/abs/2509.10882v1
- Date: Sat, 13 Sep 2025 16:26:38 GMT
- Title: Term2Note: Synthesising Differentially Private Clinical Notes from Medical Terms
- Authors: Yuping Wu, Viktor Schlegel, Warren Del-Pinto, Srinivasan Nandakumar, Iqra Zahid, Yidan Sun, Usama Farghaly Omar, Amirah Jasmine, Arun-Kumar Kaliya-Perumal, Chun Shen Tham, Gabriel Connors, Anil A Bharath, Goran Nenadic,
- Abstract summary: Term2Note is a methodology toe long clinical notes under strong DP constraints.<n>It produces synthetic notes with statistical properties closely aligned with real clinical notes.<n>It achieves substantial improvements in both fidelity and utility while operating under fewer assumptions.
- Score: 22.19967672101843
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Training data is fundamental to the success of modern machine learning models, yet in high-stakes domains such as healthcare, the use of real-world training data is severely constrained by concerns over privacy leakage. A promising solution to this challenge is the use of differentially private (DP) synthetic data, which offers formal privacy guarantees while maintaining data utility. However, striking the right balance between privacy protection and utility remains challenging in clinical note synthesis, given its domain specificity and the complexity of long-form text generation. In this paper, we present Term2Note, a methodology to synthesise long clinical notes under strong DP constraints. By structurally separating content and form, Term2Note generates section-wise note content conditioned on DP medical terms, with each governed by separate DP constraints. A DP quality maximiser further enhances synthetic notes by selecting high-quality outputs. Experimental results show that Term2Note produces synthetic notes with statistical properties closely aligned with real clinical notes, demonstrating strong fidelity. In addition, multi-label classification models trained on these synthetic notes perform comparably to those trained on real data, confirming their high utility. Compared to existing DP text generation baselines, Term2Note achieves substantial improvements in both fidelity and utility while operating under fewer assumptions, suggesting its potential as a viable privacy-preserving alternative to using sensitive clinical notes.
Related papers
- How to DP-fy Your Data: A Practical Guide to Generating Synthetic Data With Differential Privacy [52.00934156883483]
Differential Privacy (DP) is a framework for reasoning about and limiting information leakage.<n>Differentially Private Synthetic data refers to synthetic data that preserves the overall trends of source data.
arXiv Detail & Related papers (2025-12-02T21:14:39Z) - Sensitivity, Specificity, and Consistency: A Tripartite Evaluation of Privacy Filters for Synthetic Data Generation [57.13635002340272]
Post-hoc privacy filtering techniques have been proposed to remove samples containing personally identifiable information.<n>This work presents a rigorous evaluation of a filtering pipeline applied to chest X-ray synthesis.<n>We conclude that substantial advances in filter design are needed before these methods can be confidently deployed in sensitive applications.
arXiv Detail & Related papers (2025-10-02T08:32:20Z) - How Real Are Synthetic Therapy Conversations? Evaluating Fidelity in Prolonged Exposure Dialogues [14.457387337806765]
Synthetic data adoption in healthcare is driven by privacy concerns, data access limitations, and high annotation costs.<n>We explore synthetic Prolonged Exposure (PE) therapy conversations for PTSD as a scalable alternative for training clinical models.<n>We systematically compare real and synthetic dialogues using linguistic, structural, and protocol-specific metrics like turn-taking and treatment fidelity.
arXiv Detail & Related papers (2025-04-30T16:56:56Z) - DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators [47.86275136491794]
We propose DP-2Stage, a two-stage fine-tuning framework for differentially private data generation.<n>Our empirical results show that this approach improves performance across various settings and metrics.
arXiv Detail & Related papers (2024-12-03T14:10:09Z) - Robust Privacy Amidst Innovation with Large Language Models Through a Critical Assessment of the Risks [7.928574214440075]
This study examines integrating EHRs and NLP with large language models (LLMs) to improve healthcare data management and patient care.
It focuses on using advanced models to create secure, HIPAA-compliant synthetic patient notes for biomedical research.
arXiv Detail & Related papers (2024-07-23T04:20:14Z) - De-identification is not enough: a comparison between de-identified and synthetic clinical notes [8.506138767850773]
We show that de-identification of real clinical notes does not protect records against a membership inference attack.<n>When synthetically generated notes closely match the performance of real data, they also exhibit similar privacy concerns to the real data.
arXiv Detail & Related papers (2024-01-31T21:14:01Z) - Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes [11.106831545858656]
We create synthetic large-scale clinical notes using publicly available case reports extracted from biomedical literature.
We then use these synthetic notes to train our specialized clinical large language model, Asclepius.
We benchmark Asclepius against several other large language models, including GPT-3.5-turbo and other open-source alternatives.
arXiv Detail & Related papers (2023-09-01T04:01:20Z) - Private, fair and accurate: Training large-scale, privacy-preserving AI models in medical imaging [47.99192239793597]
We evaluated the effect of privacy-preserving training of AI models regarding accuracy and fairness compared to non-private training.
Our study shows that -- under the challenging realistic circumstances of a real-life clinical dataset -- the privacy-preserving training of diagnostic deep learning models is possible with excellent diagnostic accuracy and fairness.
arXiv Detail & Related papers (2023-02-03T09:49:13Z) - Effective and Privacy preserving Tabular Data Synthesizing [0.0]
We develop novel conditional table GAN architecture that can model diverse data types with complex distributions.
We train CTAB-GAN with strict privacy guarantees to ensure greater security for training GANs against malicious privacy attacks.
arXiv Detail & Related papers (2021-08-11T13:55:48Z) - Privacy-preserving medical image analysis [53.4844489668116]
We present PriMIA, a software framework designed for privacy-preserving machine learning (PPML) in medical imaging.
We show significantly better classification performance of a securely aggregated federated learning model compared to human experts on unseen datasets.
We empirically evaluate the framework's security against a gradient-based model inversion attack.
arXiv Detail & Related papers (2020-12-10T13:56:00Z) - An Interpretable End-to-end Fine-tuning Approach for Long Clinical Text [72.62848911347466]
Unstructured clinical text in EHRs contains crucial information for applications including decision support, trial matching, and retrospective research.
Recent work has applied BERT-based models to clinical information extraction and text classification, given these models' state-of-the-art performance in other NLP domains.
In this work, we propose a novel fine-tuning approach called SnipBERT. Instead of using entire notes, SnipBERT identifies crucial snippets and feeds them into a truncated BERT-based model in a hierarchical manner.
arXiv Detail & Related papers (2020-11-12T17:14:32Z) - Hide-and-Seek Privacy Challenge [88.49671206936259]
The NeurIPS 2020 Hide-and-Seek Privacy Challenge is a novel two-tracked competition to accelerate progress in tackling both problems.
In our head-to-head format, participants in the synthetic data generation track (i.e. "hiders") and the patient re-identification track (i.e. "seekers") are directly pitted against each other by way of a new, high-quality intensive care time-series dataset.
arXiv Detail & Related papers (2020-07-23T15:50:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.