Generating Synthetic Free-text Medical Records with Low Re-identification Risk using Masked Language Modeling
- URL: http://arxiv.org/abs/2409.09831v2
- Date: Tue, 17 Sep 2024 11:18:37 GMT
- Title: Generating Synthetic Free-text Medical Records with Low Re-identification Risk using Masked Language Modeling
- Authors: Samuel Belkadi, Libo Ren, Nicolo Micheletti, Lifeng Han, Goran Nenadic,
- Abstract summary: We present a system that generates synthetic free-text medical records using Masked Language Modeling (MLM)
Our system is designed to preserve the critical information of the records while introducing significant diversity and minimizing re-identification risk.
- Score: 6.193782515824411
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present a system that generates synthetic free-text medical records, such as discharge summaries, admission notes and doctor correspondences, using Masked Language Modeling (MLM). Our system is designed to preserve the critical information of the records while introducing significant diversity and minimizing re-identification risk. The system incorporates a de-identification component that uses Philter to mask Protected Health Information (PHI), followed by a Medical Entity Recognition (NER) model to retain key medical information. We explore various masking ratios and mask-filling techniques to balance the trade-off between diversity and fidelity in the synthetic outputs without affecting overall readability. Our results demonstrate that the system can produce high-quality synthetic data with significant diversity while achieving a HIPAA-compliant PHI recall rate of 0.96 and a low re-identification risk of 0.035. Furthermore, downstream evaluations using a NER task reveal that the synthetic data can be effectively used to train models with performance comparable to those trained on real data. The flexibility of the system allows it to be adapted for specific use cases, making it a valuable tool for privacy-preserving data generation in medical research and healthcare applications.
Related papers
- Unlocking Potential Binders: Multimodal Pretraining DEL-Fusion for Denoising DNA-Encoded Libraries [51.72836644350993]
Multimodal Pretraining DEL-Fusion model (MPDF)
We develop pretraining tasks applying contrastive objectives between different compound representations and their text descriptions.
We propose a novel DEL-fusion framework that amalgamates compound information at the atomic, submolecular, and molecular levels.
arXiv Detail & Related papers (2024-09-07T17:32:21Z) - Robust Privacy Amidst Innovation with Large Language Models Through a Critical Assessment of the Risks [7.928574214440075]
This study examines integrating EHRs and NLP with large language models (LLMs) to improve healthcare data management and patient care.
It focuses on using advanced models to create secure, HIPAA-compliant synthetic patient notes for biomedical research.
arXiv Detail & Related papers (2024-07-23T04:20:14Z) - Synthetic Image Learning: Preserving Performance and Preventing Membership Inference Attacks [5.0243930429558885]
This paper introduces Knowledge Recycling (KR), a pipeline designed to optimise the generation and use of synthetic data for training downstream classifiers.
At the heart of this pipeline is Generative Knowledge Distillation (GKD), the proposed technique that significantly improves the quality and usefulness of the information.
The results show a significant reduction in the performance gap between models trained on real and synthetic data, with models based on synthetic data outperforming those trained on real data in some cases.
arXiv Detail & Related papers (2024-07-22T10:31:07Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - PathLDM: Text conditioned Latent Diffusion Model for Histopathology [62.970593674481414]
We introduce PathLDM, the first text-conditioned Latent Diffusion Model tailored for generating high-quality histopathology images.
Our approach fuses image and textual data to enhance the generation process.
We achieved a SoTA FID score of 7.64 for text-to-image generation on the TCGA-BRCA dataset, significantly outperforming the closest text-conditioned competitor with FID 30.1.
arXiv Detail & Related papers (2023-09-01T22:08:32Z) - Medical Scientific Table-to-Text Generation with Human-in-the-Loop under
the Data Sparsity Constraint [11.720364723821993]
An efficient tableto-text summarization system can drastically reduce manual efforts to condense this data into reports.
However, in practice, the problem is heavily impeded by the data paucity, data sparsity and inability of the state-of-the-art natural language generation models to produce accurate and reliable outputs.
We propose a novel table-to-text approach and tackle these problems with a novel two-step architecture which is enhanced by auto-correction, copy mechanism and synthetic data augmentation.
arXiv Detail & Related papers (2022-05-24T21:10:57Z) - G-MIND: An End-to-End Multimodal Imaging-Genetics Framework for
Biomarker Identification and Disease Classification [49.53651166356737]
We propose a novel deep neural network architecture to integrate imaging and genetics data, as guided by diagnosis, that provides interpretable biomarkers.
We have evaluated our model on a population study of schizophrenia that includes two functional MRI (fMRI) paradigms and Single Nucleotide Polymorphism (SNP) data.
arXiv Detail & Related papers (2021-01-27T19:28:04Z) - Fidelity and Privacy of Synthetic Medical Data [0.0]
The digitization of medical records ushered in a new era of big data to clinical science.
The need to share individual-level medical data continues to grow, and has never been more urgent.
enthusiasm for the use of big data has been tempered by a fully appropriate concern for patient autonomy and privacy.
arXiv Detail & Related papers (2021-01-18T23:01:27Z) - Privacy-preserving medical image analysis [53.4844489668116]
We present PriMIA, a software framework designed for privacy-preserving machine learning (PPML) in medical imaging.
We show significantly better classification performance of a securely aggregated federated learning model compared to human experts on unseen datasets.
We empirically evaluate the framework's security against a gradient-based model inversion attack.
arXiv Detail & Related papers (2020-12-10T13:56:00Z) - Semi-supervised Medical Image Classification with Relation-driven
Self-ensembling Model [71.80319052891817]
We present a relation-driven semi-supervised framework for medical image classification.
It exploits the unlabeled data by encouraging the prediction consistency of given input under perturbations.
Our method outperforms many state-of-the-art semi-supervised learning methods on both single-label and multi-label image classification scenarios.
arXiv Detail & Related papers (2020-05-15T06:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.