Related papers: Does BERT Pretrained on Clinical Notes Reveal Sensitive Data?

Does BERT Pretrained on Clinical Notes Reveal Sensitive Data?

URL: http://arxiv.org/abs/2104.07762v1
Date: Thu, 15 Apr 2021 20:40:05 GMT
Title: Does BERT Pretrained on Clinical Notes Reveal Sensitive Data?
Authors: Eric Lehman, Sarthak Jain, Karl Pichotta, Yoav Goldberg, Byron C. Wallace
Abstract summary: We design a battery of approaches intended to recover Personal Health Information from a trained BERT. Specifically, we attempt to recover patient names and conditions with which they are associated. We find that simple probing methods are not able to meaningfully extract sensitive information from BERT trained over the MIMIC-III corpus of EHR.
Score: 70.3631443249802
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Transformers pretrained over clinical notes from Electronic Health Records (EHR) have afforded substantial gains in performance on predictive clinical tasks. The cost of training such models (and the necessity of data access to do so) coupled with their utility motivates parameter sharing, i.e., the release of pretrained models such as ClinicalBERT. While most efforts have used deidentified EHR, many researchers have access to large sets of sensitive, non-deidentified EHR with which they might train a BERT model (or similar). Would it be safe to release the weights of such a model if they did? In this work, we design a battery of approaches intended to recover Personal Health Information (PHI) from a trained BERT. Specifically, we attempt to recover patient names and conditions with which they are associated. We find that simple probing methods are not able to meaningfully extract sensitive information from BERT trained over the MIMIC-III corpus of EHR. However, more sophisticated "attacks" may succeed in doing so: To facilitate such research, we make our experimental setup and baseline probing models available at https://github.com/elehman16/exposing_patient_data_release

Related papers

DR.EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data [2.9929405444223205]
EHRs are pivotal in clinical practices, yet their retrieval remains a challenge mainly due to semantic gap issues.<n>Recent advancements in dense retrieval offer promising solutions but existing models, both general-domain and biomedical-domain, fall short due to insufficient medical knowledge or mismatched training corpora.<n>This paper introduces textttDR.EHR, a series of dense retrieval models specifically tailored for EHR retrieval.
arXiv Detail & Related papers (2025-07-24T17:02:46Z)
Ensemble BERT for Medication Event Classification on Electronic Health Records (EHRs) [1.5211498825393837]
n2c2 2022 provided shared tasks on challenges in natural language processing for clinical data analytics on electronic health records (EHR)<n>This study focuses on subtask 2 in Track 1 of this challenge that is to detect and classify medication events from clinical notes through building a novel BERT-based ensemble model.<n>It started with pretraining BERT models on different types of big data such as Wikipedia and MIMIC. Afterwards, these pretrained BERT models were fine-tuned on CMED training data.<n> Experimental results demonstrated that BERT-based ensemble models can effectively improve strict Micro-F score by about 5% and
arXiv Detail & Related papers (2025-06-29T16:17:17Z)
Chatting Up Attachment: Using LLMs to Predict Adult Bonds [0.0]
We use GPT-4 and Claude 3 Opus to create agents that simulate adults with varying profiles, childhood memories, and attachment styles. We evaluate our models using a transcript dataset from 9 humans who underwent the same interview protocol, analyzed and labeled by mental health professionals. Our findings indicate that training the models using only synthetic data achieves performance comparable to training the models on human data.
arXiv Detail & Related papers (2024-08-31T04:29:19Z)
BAPLe: Backdoor Attacks on Medical Foundational Models using Prompt Learning [71.60858267608306]
Medical foundation models are susceptible to backdoor attacks. This work introduces a method to embed a backdoor into the medical foundation model during the prompt learning phase. Our method, BAPLe, requires only a minimal subset of data to adjust the noise trigger and the text prompts for downstream tasks.
arXiv Detail & Related papers (2024-08-14T10:18:42Z)
Utilizing Large Language Models to Generate Synthetic Data to Increase the Performance of BERT-Based Neural Networks [0.7071166713283337]
We created datasets large enough to train machine learning models. Our goal is to label behaviors corresponding to autism criteria. Augmenting data increased recall by 13% but decreased precision by 16%.
arXiv Detail & Related papers (2024-05-08T03:18:12Z)
EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models [6.506937003687058]
We publish a new dataset, EHRSHOT, which contains deidentified structured data from the electronic health records (EHRs) of 6,739 patients from Stanford Medicine. Second, we publish the weights of CLMBR-T-base, a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. Third, we define 15 few-shot clinical prediction tasks, enabling evaluation of foundation models on benefits such as sample efficiency and task adaptation.
arXiv Detail & Related papers (2023-07-05T05:24:59Z)
Membership Inference Attacks against Synthetic Data through Overfitting Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution. We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z)
Textual Data Augmentation for Patient Outcomes Prediction [67.72545656557858]
We propose a novel data augmentation method to generate artificial clinical notes in patients' Electronic Health Records. We fine-tune the generative language model GPT-2 to synthesize labeled text with the original training data. We evaluate our method on the most common patient outcome, i.e., the 30-day readmission rate.
arXiv Detail & Related papers (2022-11-13T01:07:23Z)
Pre-training transformer-based framework on large-scale pediatric claims data for downstream population-specific tasks [3.1580072841682734]
This study presents the Claim Pre-Training (Claim-PT) framework, a generic pre-training model that first trains on the entire pediatric claims dataset. The effective knowledge transfer is completed through the task-aware fine-tuning stage. We conducted experiments on a real-world claims dataset with more than one million patient records.
arXiv Detail & Related papers (2021-06-24T15:25:41Z)
Bootstrapping Your Own Positive Sample: Contrastive Learning With Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model. We introduce two unique positive sampling strategies specifically tailored for EHR data. Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
EVA: Generating Longitudinal Electronic Health Records Using Conditional Variational Autoencoders [34.22731849545798]
We propose EHR Variational Autoencoder (EVA) for synthesizing sequences of discrete EHR encounters and encounter features. We illustrate that EVA can produce realistic sequences, account for individual differences among patients, and can be conditioned on specific disease conditions. We assess the utility of the methods on large real-world EHR repositories containing over 250, 000 patients.
arXiv Detail & Related papers (2020-12-18T02:37:49Z)
DeepEnroll: Patient-Trial Matching with Deep Embedding and Entailment Prediction [67.91606509226132]
Clinical trials are essential for drug development but often suffer from expensive, inaccurate and insufficient patient recruitment. DeepEnroll is a cross-modal inference learning model to jointly encode enrollment criteria (tabular data) into a shared latent space for matching inference.
arXiv Detail & Related papers (2020-01-22T17:51:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.