DIRI: Adversarial Patient Reidentification with Large Language Models for Evaluating Clinical Text Anonymization
- URL: http://arxiv.org/abs/2410.17035v1
- Date: Tue, 22 Oct 2024 14:06:31 GMT
- Title: DIRI: Adversarial Patient Reidentification with Large Language Models for Evaluating Clinical Text Anonymization
- Authors: John X. Morris, Thomas R. Campion, Sri Laasya Nutheti, Yifan Peng, Akhil Raj, Ramin Zabih, Curtis L. Cole,
- Abstract summary: We develop an adversarial approach using a large language model to re-identify the patient corresponding to a redacted clinical note.
Our method uses a large language model to reidentify the patient corresponding to a redacted clinical note.
Although ClinicalBERT was the most effective, masking all identified PII, our tool still reidentified 9% of clinical notes.
- Score: 13.038800602897354
- License:
- Abstract: Sharing protected health information (PHI) is critical for furthering biomedical research. Before data can be distributed, practitioners often perform deidentification to remove any PHI contained in the text. Contemporary deidentification methods are evaluated on highly saturated datasets (tools achieve near-perfect accuracy) which may not reflect the full variability or complexity of real-world clinical text and annotating them is resource intensive, which is a barrier to real-world applications. To address this gap, we developed an adversarial approach using a large language model (LLM) to re-identify the patient corresponding to a redacted clinical note and evaluated the performance with a novel De-Identification/Re-Identification (DIRI) method. Our method uses a large language model to reidentify the patient corresponding to a redacted clinical note. We demonstrate our method on medical data from Weill Cornell Medicine anonymized with three deidentification tools: rule-based Philter and two deep-learning-based models, BiLSTM-CRF and ClinicalBERT. Although ClinicalBERT was the most effective, masking all identified PII, our tool still reidentified 9% of clinical notes Our study highlights significant weaknesses in current deidentification technologies while providing a tool for iterative development and improvement.
Related papers
- Clinical Evaluation of Medical Image Synthesis: A Case Study in Wireless Capsule Endoscopy [63.39037092484374]
This study focuses on the clinical evaluation of medical Synthetic Data Generation using Artificial Intelligence (AI) models.
The paper contributes by a) presenting a protocol for the systematic evaluation of synthetic images by medical experts and b) applying it to assess TIDE-II, a novel variational autoencoder-based model for high-resolution WCE image synthesis.
The results show that TIDE-II generates clinically relevant WCE images, helping to address data scarcity and enhance diagnostic tools.
arXiv Detail & Related papers (2024-10-31T19:48:50Z) - DeIDClinic: A Multi-Layered Framework for De-identification of Clinical Free-text Data [6.473402241020136]
This work enhances the MASK framework by integrating ClinicalBERT, a deep learning model specifically fine-tuned on clinical texts.
The system effectively identifies and either redacts or replaces sensitive identifiable entities within clinical documents.
A risk assessment feature has also been developed, which analyses the uniqueness of context within documents to classify them into risk levels.
arXiv Detail & Related papers (2024-10-02T15:16:02Z) - Improving Clinical Note Generation from Complex Doctor-Patient Conversation [20.2157016701399]
We present three key contributions to the field of clinical note generation using large language models (LLMs)
First, we introduce CliniKnote, a dataset consisting of 1,200 complex doctor-patient conversations paired with their full clinical notes.
Second, we propose K-SOAP, which enhances traditional SOAPcitepodder20soap (Subjective, Objective, Assessment, and Plan) notes by adding a keyword section at the top, allowing for quick identification of essential information.
Third, we develop an automatic pipeline to generate K-SOAP notes from doctor-patient conversations and benchmark various modern LLMs using various
arXiv Detail & Related papers (2024-08-26T18:39:31Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - sEHR-CE: Language modelling of structured EHR data for efficient and
generalizable patient cohort expansion [0.0]
sEHR-CE is a novel framework based on transformers to enable integrated phenotyping and analyses of heterogeneous clinical datasets.
We validate our approach using primary and secondary care data from the UK Biobank, a large-scale research study.
arXiv Detail & Related papers (2022-11-30T16:00:43Z) - Classifying Cyber-Risky Clinical Notes by Employing Natural Language
Processing [9.77063694539068]
Recently, some states within the United States of America require patients to have open access to their clinical notes.
This research investigates methods for identifying security/privacy risks within clinical notes.
arXiv Detail & Related papers (2022-03-24T00:36:59Z) - Towards more patient friendly clinical notes through language models and
ontologies [57.51898902864543]
We present a novel approach to automated medical text based on word simplification and language modelling.
We use a new dataset pairs of publicly available medical sentences and a version of them simplified by clinicians.
Our method based on a language model trained on medical forum data generates simpler sentences while preserving both grammar and the original meaning.
arXiv Detail & Related papers (2021-12-23T16:11:19Z) - Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching.
We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders.
We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.