DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4
- URL: http://arxiv.org/abs/2303.11032v2
- Date: Thu, 21 Dec 2023 16:13:05 GMT
- Title: DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4
- Authors: Zhengliang Liu, Yue Huang, Xiaowei Yu, Lu Zhang, Zihao Wu, Chao Cao,
Haixing Dai, Lin Zhao, Yiwei Li, Peng Shu, Fang Zeng, Lichao Sun, Wei Liu,
Dinggang Shen, Quanzheng Li, Tianming Liu, Dajiang Zhu, Xiang Li
- Abstract summary: We develop a novel GPT4-enabled de-identification framework (DeID-GPT")
Our developed DeID-GPT showed the highest accuracy and remarkable reliability in masking private information from the unstructured medical text.
This study is one of the earliest to utilize ChatGPT and GPT-4 for medical text data processing and de-identification.
- Score: 80.36535668574804
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The digitization of healthcare has facilitated the sharing and re-using of
medical data but has also raised concerns about confidentiality and privacy.
HIPAA (Health Insurance Portability and Accountability Act) mandates removing
re-identifying information before the dissemination of medical records. Thus,
effective and efficient solutions for de-identifying medical data, especially
those in free-text forms, are highly needed. While various computer-assisted
de-identification methods, including both rule-based and learning-based, have
been developed and used in prior practice, such solutions still lack
generalizability or need to be fine-tuned according to different scenarios,
significantly imposing restrictions in wider use. The advancement of large
language models (LLM), such as ChatGPT and GPT-4, have shown great potential in
processing text data in the medical domain with zero-shot in-context learning,
especially in the task of privacy protection, as these models can identify
confidential information by their powerful named entity recognition (NER)
capability. In this work, we developed a novel GPT4-enabled de-identification
framework (``DeID-GPT") to automatically identify and remove the identifying
information. Compared to existing commonly used medical text data
de-identification methods, our developed DeID-GPT showed the highest accuracy
and remarkable reliability in masking private information from the unstructured
medical text while preserving the original structure and meaning of the text.
This study is one of the earliest to utilize ChatGPT and GPT-4 for medical text
data processing and de-identification, which provides insights for further
research and solution development on the use of LLMs such as ChatGPT/GPT-4 in
healthcare. Codes and benchmarking data information are available at
https://github.com/yhydhx/ChatGPT-API.
Related papers
- FEDMEKI: A Benchmark for Scaling Medical Foundation Models via Federated Knowledge Injection [83.54960238236548]
FEDMEKI not only preserves data privacy but also enhances the capability of medical foundation models.
FEDMEKI allows medical foundation models to learn from a broader spectrum of medical knowledge without direct data exposure.
arXiv Detail & Related papers (2024-08-17T15:18:56Z) - Robust Privacy Amidst Innovation with Large Language Models Through a Critical Assessment of the Risks [7.928574214440075]
This study examines integrating EHRs and NLP with large language models (LLMs) to improve healthcare data management and patient care.
It focuses on using advanced models to create secure, HIPAA-compliant synthetic patient notes for biomedical research.
arXiv Detail & Related papers (2024-07-23T04:20:14Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - An Easy-to-use and Robust Approach for the Differentially Private
De-Identification of Clinical Textual Documents [0.0]
This paper shows how an efficient and differentially private de-identification approach can be achieved by strengthening the less robust de-identification.
The result is an approach for de-identifying clinical documents in French language, but also generalizable to other languages.
arXiv Detail & Related papers (2022-11-02T14:25:09Z) - De-Identification of French Unstructured Clinical Notes for Machine
Learning Tasks [0.0]
We propose a new comprehensive de-identification method dedicated to French-language medical documents.
The approach has been evaluated on a French language medical dataset of a French public hospital.
arXiv Detail & Related papers (2022-09-16T13:00:47Z) - Towards more patient friendly clinical notes through language models and
ontologies [57.51898902864543]
We present a novel approach to automated medical text based on word simplification and language modelling.
We use a new dataset pairs of publicly available medical sentences and a version of them simplified by clinicians.
Our method based on a language model trained on medical forum data generates simpler sentences while preserving both grammar and the original meaning.
arXiv Detail & Related papers (2021-12-23T16:11:19Z) - Privacy-preserving medical image analysis [53.4844489668116]
We present PriMIA, a software framework designed for privacy-preserving machine learning (PPML) in medical imaging.
We show significantly better classification performance of a securely aggregated federated learning model compared to human experts on unseen datasets.
We empirically evaluate the framework's security against a gradient-based model inversion attack.
arXiv Detail & Related papers (2020-12-10T13:56:00Z) - MASK: A flexible framework to facilitate de-identification of clinical
texts [2.3015324171336378]
We present MASK, a software package that is designed to perform the de-identification task.
The software is able to perform named entity recognition using some of the state-of-the-art techniques and then mask or redact recognized entities.
arXiv Detail & Related papers (2020-05-24T08:53:00Z) - Comparing Rule-based, Feature-based and Deep Neural Methods for
De-identification of Dutch Medical Records [4.339510167603376]
We construct a varied dataset consisting of the medical records of 1260 patients by sampling data from 9 institutes and three domains of Dutch healthcare.
We test the generalizability of three de-identification methods across languages and domains.
arXiv Detail & Related papers (2020-01-16T09:42:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.