De-identification is not always enough
- URL: http://arxiv.org/abs/2402.00179v1
- Date: Wed, 31 Jan 2024 21:14:01 GMT
- Title: De-identification is not always enough
- Authors: Atiquer Rahman Sarkar, Yao-Shun Chuang, Noman Mohammed, Xiaoqian Jiang
- Abstract summary: We show that de-identification of real clinical notes does not protect records against a membership inference attack.
When synthetically generated notes closely match the performance of real data, they also exhibit similar privacy concerns to the real data.
- Score: 9.292345527034348
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For sharing privacy-sensitive data, de-identification is commonly regarded as
adequate for safeguarding privacy. Synthetic data is also being considered as a
privacy-preserving alternative. Recent successes with numerical and tabular
data generative models and the breakthroughs in large generative language
models raise the question of whether synthetically generated clinical notes
could be a viable alternative to real notes for research purposes. In this
work, we demonstrated that (i) de-identification of real clinical notes does
not protect records against a membership inference attack, (ii) proposed a
novel approach to generate synthetic clinical notes using the current
state-of-the-art large language models, (iii) evaluated the performance of the
synthetically generated notes in a clinical domain task, and (iv) proposed a
way to mount a membership inference attack where the target model is trained
with synthetic data. We observed that when synthetically generated notes
closely match the performance of real data, they also exhibit similar privacy
concerns to the real data. Whether other approaches to synthetically generated
clinical notes could offer better trade-offs and become a better alternative to
sensitive real notes warrants further investigation.
Related papers
- Synthetic4Health: Generating Annotated Synthetic Clinical Letters [6.822926897514792]
Since clinical letters contain sensitive information, clinical-related datasets can not be widely applied in model training, medical research, and teaching.
This work aims to generate reliable, various, and de-identified synthetic clinical letters.
arXiv Detail & Related papers (2024-09-14T18:15:07Z) - Robust Privacy Amidst Innovation with Large Language Models Through a Critical Assessment of the Risks [7.928574214440075]
This study examines integrating EHRs and NLP with large language models (LLMs) to improve healthcare data management and patient care.
It focuses on using advanced models to create secure, HIPAA-compliant synthetic patient notes for biomedical research.
arXiv Detail & Related papers (2024-07-23T04:20:14Z) - Synthetic Data Outliers: Navigating Identity Disclosure [3.8811062755861956]
We analyze the privacy of synthetic data w.r.t the outliers.
Our main findings suggest that outliers re-identification via linkage attack is feasible and easily achieved.
Additional safeguards such as differential privacy can prevent re-identification, albeit at the expense of the data utility.
arXiv Detail & Related papers (2024-06-04T19:35:44Z) - Reliability in Semantic Segmentation: Can We Use Synthetic Data? [69.28268603137546]
We show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models.
This synthetic data is employed to evaluate the robustness of pretrained segmenters.
We demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters.
arXiv Detail & Related papers (2023-12-14T18:56:07Z) - A Discrepancy Aware Framework for Robust Anomaly Detection [51.710249807397695]
We present a Discrepancy Aware Framework (DAF), which demonstrates robust performance consistently with simple and cheap strategies.
Our method leverages an appearance-agnostic cue to guide the decoder in identifying defects, thereby alleviating its reliance on synthetic appearance.
Under the simple synthesis strategies, it outperforms existing methods by a large margin. Furthermore, it also achieves the state-of-the-art localization performance.
arXiv Detail & Related papers (2023-10-11T15:21:40Z) - Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes [11.106831545858656]
We create synthetic large-scale clinical notes using publicly available case reports extracted from biomedical literature.
We then use these synthetic notes to train our specialized clinical large language model, Asclepius.
We benchmark Asclepius against several other large language models, including GPT-3.5-turbo and other open-source alternatives.
arXiv Detail & Related papers (2023-09-01T04:01:20Z) - Synthetic Text Generation with Differential Privacy: A Simple and
Practical Recipe [32.63295550058343]
We show that a simple and practical recipe in the text domain is effective in generating useful synthetic text with strong privacy protection.
Our method produces synthetic text that is competitive in terms of utility with its non-private counterpart.
arXiv Detail & Related papers (2022-10-25T21:21:17Z) - Privacy-preserving medical image analysis [53.4844489668116]
We present PriMIA, a software framework designed for privacy-preserving machine learning (PPML) in medical imaging.
We show significantly better classification performance of a securely aggregated federated learning model compared to human experts on unseen datasets.
We empirically evaluate the framework's security against a gradient-based model inversion attack.
arXiv Detail & Related papers (2020-12-10T13:56:00Z) - An Interpretable End-to-end Fine-tuning Approach for Long Clinical Text [72.62848911347466]
Unstructured clinical text in EHRs contains crucial information for applications including decision support, trial matching, and retrospective research.
Recent work has applied BERT-based models to clinical information extraction and text classification, given these models' state-of-the-art performance in other NLP domains.
In this work, we propose a novel fine-tuning approach called SnipBERT. Instead of using entire notes, SnipBERT identifies crucial snippets and feeds them into a truncated BERT-based model in a hierarchical manner.
arXiv Detail & Related papers (2020-11-12T17:14:32Z) - Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News [57.9843300852526]
We introduce the more realistic and challenging task of defending against machine-generated news that also includes images and captions.
To identify the possible weaknesses that adversaries can exploit, we create a NeuralNews dataset composed of 4 different types of generated articles.
In addition to the valuable insights gleaned from our user study experiments, we provide a relatively effective approach based on detecting visual-semantic inconsistencies.
arXiv Detail & Related papers (2020-09-16T14:13:15Z) - Hide-and-Seek Privacy Challenge [88.49671206936259]
The NeurIPS 2020 Hide-and-Seek Privacy Challenge is a novel two-tracked competition to accelerate progress in tackling both problems.
In our head-to-head format, participants in the synthetic data generation track (i.e. "hiders") and the patient re-identification track (i.e. "seekers") are directly pitted against each other by way of a new, high-quality intensive care time-series dataset.
arXiv Detail & Related papers (2020-07-23T15:50:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.