Benchmarking Modern Named Entity Recognition Techniques for Free-text
Health Record De-identification
- URL: http://arxiv.org/abs/2103.13546v1
- Date: Thu, 25 Mar 2021 01:26:58 GMT
- Title: Benchmarking Modern Named Entity Recognition Techniques for Free-text
Health Record De-identification
- Authors: Abdullah Ahmed, Adeel Abbasi, Carsten Eickhoff
- Abstract summary: Federal law restricts the sharing of any EHR data that contains protected health information (PHI)
This project explores several deep learning-based named entity recognition (NER) methods to determine which method(s) perform better on the de-identification task.
We trained and tested our models on the i2b2 training dataset, and qualitatively assessed their performance using EHR data collected from a local hospital.
- Score: 6.026640792312181
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Electronic Health Records (EHRs) have become the primary form of medical
data-keeping across the United States. Federal law restricts the sharing of any
EHR data that contains protected health information (PHI). De-identification,
the process of identifying and removing all PHI, is crucial for making EHR data
publicly available for scientific research. This project explores several deep
learning-based named entity recognition (NER) methods to determine which
method(s) perform better on the de-identification task. We trained and tested
our models on the i2b2 training dataset, and qualitatively assessed their
performance using EHR data collected from a local hospital. We found that 1)
BiLSTM-CRF represents the best-performing encoder/decoder combination, 2)
character-embeddings and CRFs tend to improve precision at the price of recall,
and 3) transformers alone under-perform as context encoders. Future work
focused on structuring medical text may improve the extraction of semantic and
syntactic information for the purposes of EHR de-identification.
Related papers
- DIRI: Adversarial Patient Reidentification with Large Language Models for Evaluating Clinical Text Anonymization [13.038800602897354]
We develop an adversarial approach using a large language model to re-identify the patient corresponding to a redacted clinical note.
Our method uses a large language model to reidentify the patient corresponding to a redacted clinical note.
Although ClinicalBERT was the most effective, masking all identified PII, our tool still reidentified 9% of clinical notes.
arXiv Detail & Related papers (2024-10-22T14:06:31Z) - Sensitive Data Detection with High-Throughput Machine Learning Models in
Electrical Health Records [15.982220037507169]
The Health Insurance Portability and Accountability Act of 1996 (HIPAA) is a federal law designed to protect sensitive health information (PHI)
One of the challenges in this area of research is the heterogeneous nature of PHI fields in data across different parties.
This variability makes rule-based sensitive variable identification systems that work on one database fail on another.
arXiv Detail & Related papers (2023-04-30T16:14:23Z) - DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 [80.36535668574804]
We develop a novel GPT4-enabled de-identification framework (DeID-GPT")
Our developed DeID-GPT showed the highest accuracy and remarkable reliability in masking private information from the unstructured medical text.
This study is one of the earliest to utilize ChatGPT and GPT-4 for medical text data processing and de-identification.
arXiv Detail & Related papers (2023-03-20T11:34:37Z) - Rediscovery of CNN's Versatility for Text-based Encoding of Raw
Electronic Health Records [22.203204279166496]
We search for a versatile encoder not only reducing the large data into a manageable size but also well preserving the core information of patients to perform diverse clinical tasks.
We found that hierarchically structured Convolutional Neural Network (CNN) often outperforms the state-of-the-art model on diverse tasks.
arXiv Detail & Related papers (2023-03-15T00:37:18Z) - 2021 BEETL Competition: Advancing Transfer Learning for Subject
Independence & Heterogenous EEG Data Sets [89.84774119537087]
We design two transfer learning challenges around diagnostics and Brain-Computer-Interfacing (BCI)
Task 1 is centred on medical diagnostics, addressing automatic sleep stage annotation across subjects.
Task 2 is centred on Brain-Computer Interfacing (BCI), addressing motor imagery decoding across both subjects and data sets.
arXiv Detail & Related papers (2022-02-14T12:12:20Z) - EVA: Generating Longitudinal Electronic Health Records Using Conditional
Variational Autoencoders [34.22731849545798]
We propose EHR Variational Autoencoder (EVA) for synthesizing sequences of discrete EHR encounters and encounter features.
We illustrate that EVA can produce realistic sequences, account for individual differences among patients, and can be conditioned on specific disease conditions.
We assess the utility of the methods on large real-world EHR repositories containing over 250, 000 patients.
arXiv Detail & Related papers (2020-12-18T02:37:49Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - BiteNet: Bidirectional Temporal Encoder Network to Predict Medical
Outcomes [53.163089893876645]
We propose a novel self-attention mechanism that captures the contextual dependency and temporal relationships within a patient's healthcare journey.
An end-to-end bidirectional temporal encoder network (BiteNet) then learns representations of the patient's journeys.
We have evaluated the effectiveness of our methods on two supervised prediction and two unsupervised clustering tasks with a real-world EHR dataset.
arXiv Detail & Related papers (2020-09-24T00:42:36Z) - Uncovering the structure of clinical EEG signals with self-supervised
learning [64.4754948595556]
Supervised learning paradigms are often limited by the amount of labeled data that is available.
This phenomenon is particularly problematic in clinically-relevant data, such as electroencephalography (EEG)
By extracting information from unlabeled data, it might be possible to reach competitive performance with deep neural networks.
arXiv Detail & Related papers (2020-07-31T14:34:47Z) - MASK: A flexible framework to facilitate de-identification of clinical
texts [2.3015324171336378]
We present MASK, a software package that is designed to perform the de-identification task.
The software is able to perform named entity recognition using some of the state-of-the-art techniques and then mask or redact recognized entities.
arXiv Detail & Related papers (2020-05-24T08:53:00Z) - DeepEnroll: Patient-Trial Matching with Deep Embedding and Entailment
Prediction [67.91606509226132]
Clinical trials are essential for drug development but often suffer from expensive, inaccurate and insufficient patient recruitment.
DeepEnroll is a cross-modal inference learning model to jointly encode enrollment criteria (tabular data) into a shared latent space for matching inference.
arXiv Detail & Related papers (2020-01-22T17:51:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.