Does BERT Pretrained on Clinical Notes Reveal Sensitive Data?
- URL: http://arxiv.org/abs/2104.07762v1
- Date: Thu, 15 Apr 2021 20:40:05 GMT
- Title: Does BERT Pretrained on Clinical Notes Reveal Sensitive Data?
- Authors: Eric Lehman, Sarthak Jain, Karl Pichotta, Yoav Goldberg, Byron C.
Wallace
- Abstract summary: We design a battery of approaches intended to recover Personal Health Information from a trained BERT.
Specifically, we attempt to recover patient names and conditions with which they are associated.
We find that simple probing methods are not able to meaningfully extract sensitive information from BERT trained over the MIMIC-III corpus of EHR.
- Score: 70.3631443249802
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Transformers pretrained over clinical notes from Electronic Health
Records (EHR) have afforded substantial gains in performance on predictive
clinical tasks. The cost of training such models (and the necessity of data
access to do so) coupled with their utility motivates parameter sharing, i.e.,
the release of pretrained models such as ClinicalBERT. While most efforts have
used deidentified EHR, many researchers have access to large sets of sensitive,
non-deidentified EHR with which they might train a BERT model (or similar).
Would it be safe to release the weights of such a model if they did? In this
work, we design a battery of approaches intended to recover Personal Health
Information (PHI) from a trained BERT. Specifically, we attempt to recover
patient names and conditions with which they are associated. We find that
simple probing methods are not able to meaningfully extract sensitive
information from BERT trained over the MIMIC-III corpus of EHR. However, more
sophisticated "attacks" may succeed in doing so: To facilitate such research,
we make our experimental setup and baseline probing models available at
https://github.com/elehman16/exposing_patient_data_release
Related papers
- Chatting Up Attachment: Using LLMs to Predict Adult Bonds [0.0]
We use GPT-4 and Claude 3 Opus to create agents that simulate adults with varying profiles, childhood memories, and attachment styles.
We evaluate our models using a transcript dataset from 9 humans who underwent the same interview protocol, analyzed and labeled by mental health professionals.
Our findings indicate that training the models using only synthetic data achieves performance comparable to training the models on human data.
arXiv Detail & Related papers (2024-08-31T04:29:19Z) - BAPLe: Backdoor Attacks on Medical Foundational Models using Prompt Learning [71.60858267608306]
Medical foundation models are susceptible to backdoor attacks.
This work introduces a method to embed a backdoor into the medical foundation model during the prompt learning phase.
Our method, BAPLe, requires only a minimal subset of data to adjust the noise trigger and the text prompts for downstream tasks.
arXiv Detail & Related papers (2024-08-14T10:18:42Z) - Utilizing Large Language Models to Generate Synthetic Data to Increase the Performance of BERT-Based Neural Networks [0.7071166713283337]
We created datasets large enough to train machine learning models.
Our goal is to label behaviors corresponding to autism criteria.
Augmenting data increased recall by 13% but decreased precision by 16%.
arXiv Detail & Related papers (2024-05-08T03:18:12Z) - EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models [6.506937003687058]
We publish a new dataset, EHRSHOT, which contains deidentified structured data from the electronic health records (EHRs) of 6,739 patients from Stanford Medicine.
Second, we publish the weights of CLMBR-T-base, a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients.
Third, we define 15 few-shot clinical prediction tasks, enabling evaluation of foundation models on benefits such as sample efficiency and task adaptation.
arXiv Detail & Related papers (2023-07-05T05:24:59Z) - Membership Inference Attacks against Synthetic Data through Overfitting
Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution.
We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z) - Textual Data Augmentation for Patient Outcomes Prediction [67.72545656557858]
We propose a novel data augmentation method to generate artificial clinical notes in patients' Electronic Health Records.
We fine-tune the generative language model GPT-2 to synthesize labeled text with the original training data.
We evaluate our method on the most common patient outcome, i.e., the 30-day readmission rate.
arXiv Detail & Related papers (2022-11-13T01:07:23Z) - Pre-training transformer-based framework on large-scale pediatric claims
data for downstream population-specific tasks [3.1580072841682734]
This study presents the Claim Pre-Training (Claim-PT) framework, a generic pre-training model that first trains on the entire pediatric claims dataset.
The effective knowledge transfer is completed through the task-aware fine-tuning stage.
We conducted experiments on a real-world claims dataset with more than one million patient records.
arXiv Detail & Related papers (2021-06-24T15:25:41Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - EVA: Generating Longitudinal Electronic Health Records Using Conditional
Variational Autoencoders [34.22731849545798]
We propose EHR Variational Autoencoder (EVA) for synthesizing sequences of discrete EHR encounters and encounter features.
We illustrate that EVA can produce realistic sequences, account for individual differences among patients, and can be conditioned on specific disease conditions.
We assess the utility of the methods on large real-world EHR repositories containing over 250, 000 patients.
arXiv Detail & Related papers (2020-12-18T02:37:49Z) - DeepEnroll: Patient-Trial Matching with Deep Embedding and Entailment
Prediction [67.91606509226132]
Clinical trials are essential for drug development but often suffer from expensive, inaccurate and insufficient patient recruitment.
DeepEnroll is a cross-modal inference learning model to jointly encode enrollment criteria (tabular data) into a shared latent space for matching inference.
arXiv Detail & Related papers (2020-01-22T17:51:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.