Related papers: Benchmarking Modern Named Entity Recognition Techniques for Free-text Health Record De-identification

Benchmarking Modern Named Entity Recognition Techniques for Free-text Health Record De-identification

URL: http://arxiv.org/abs/2103.13546v1
Date: Thu, 25 Mar 2021 01:26:58 GMT
Title: Benchmarking Modern Named Entity Recognition Techniques for Free-text Health Record De-identification
Authors: Abdullah Ahmed, Adeel Abbasi, Carsten Eickhoff
Abstract summary: Federal law restricts the sharing of any EHR data that contains protected health information (PHI) This project explores several deep learning-based named entity recognition (NER) methods to determine which method(s) perform better on the de-identification task. We trained and tested our models on the i2b2 training dataset, and qualitatively assessed their performance using EHR data collected from a local hospital.
Score: 6.026640792312181
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Electronic Health Records (EHRs) have become the primary form of medical data-keeping across the United States. Federal law restricts the sharing of any EHR data that contains protected health information (PHI). De-identification, the process of identifying and removing all PHI, is crucial for making EHR data publicly available for scientific research. This project explores several deep learning-based named entity recognition (NER) methods to determine which method(s) perform better on the de-identification task. We trained and tested our models on the i2b2 training dataset, and qualitatively assessed their performance using EHR data collected from a local hospital. We found that 1) BiLSTM-CRF represents the best-performing encoder/decoder combination, 2) character-embeddings and CRFs tend to improve precision at the price of recall, and 3) transformers alone under-perform as context encoders. Future work focused on structuring medical text may improve the extraction of semantic and syntactic information for the purposes of EHR de-identification.

Related papers

Extracting OPQRST in Electronic Health Records using Large Language Models with Reasoning [3.486461799078777]
This paper introduces a novel approach to extracting the OPQRST assessment from EHRs by leveraging the capabilities of Large Language Models (LLMs)<n>We propose to reframe the task from sequence labeling to text generation, enabling the models to provide reasoning steps that mimic a physician's cognitive processes.<n>Our contributions demonstrate a significant advancement in the use of AI in healthcare, offering a scalable solution that improves the accuracy and usability of information extraction from EHRs.
arXiv Detail & Related papers (2025-09-02T02:21:02Z)
DR.EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data [2.9929405444223205]
EHRs are pivotal in clinical practices, yet their retrieval remains a challenge mainly due to semantic gap issues.<n>Recent advancements in dense retrieval offer promising solutions but existing models, both general-domain and biomedical-domain, fall short due to insufficient medical knowledge or mismatched training corpora.<n>This paper introduces textttDR.EHR, a series of dense retrieval models specifically tailored for EHR retrieval.
arXiv Detail & Related papers (2025-07-24T17:02:46Z)
DIRI: Adversarial Patient Reidentification with Large Language Models for Evaluating Clinical Text Anonymization [13.038800602897354]
We develop an adversarial approach using a large language model to re-identify the patient corresponding to a redacted clinical note. Our method uses a large language model to reidentify the patient corresponding to a redacted clinical note. Although ClinicalBERT was the most effective, masking all identified PII, our tool still reidentified 9% of clinical notes.
arXiv Detail & Related papers (2024-10-22T14:06:31Z)
Sensitive Data Detection with High-Throughput Machine Learning Models in Electrical Health Records [15.982220037507169]
The Health Insurance Portability and Accountability Act of 1996 (HIPAA) is a federal law designed to protect sensitive health information (PHI) One of the challenges in this area of research is the heterogeneous nature of PHI fields in data across different parties. This variability makes rule-based sensitive variable identification systems that work on one database fail on another.
arXiv Detail & Related papers (2023-04-30T16:14:23Z)
DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 [80.36535668574804]
We develop a novel GPT4-enabled de-identification framework (DeID-GPT") Our developed DeID-GPT showed the highest accuracy and remarkable reliability in masking private information from the unstructured medical text. This study is one of the earliest to utilize ChatGPT and GPT-4 for medical text data processing and de-identification.
arXiv Detail & Related papers (2023-03-20T11:34:37Z)
Rediscovery of CNN's Versatility for Text-based Encoding of Raw Electronic Health Records [22.203204279166496]
We search for a versatile encoder not only reducing the large data into a manageable size but also well preserving the core information of patients to perform diverse clinical tasks. We found that hierarchically structured Convolutional Neural Network (CNN) often outperforms the state-of-the-art model on diverse tasks.
arXiv Detail & Related papers (2023-03-15T00:37:18Z)
2021 BEETL Competition: Advancing Transfer Learning for Subject Independence & Heterogenous EEG Data Sets [89.84774119537087]
We design two transfer learning challenges around diagnostics and Brain-Computer-Interfacing (BCI) Task 1 is centred on medical diagnostics, addressing automatic sleep stage annotation across subjects. Task 2 is centred on Brain-Computer Interfacing (BCI), addressing motor imagery decoding across both subjects and data sets.
arXiv Detail & Related papers (2022-02-14T12:12:20Z)
EVA: Generating Longitudinal Electronic Health Records Using Conditional Variational Autoencoders [34.22731849545798]
We propose EHR Variational Autoencoder (EVA) for synthesizing sequences of discrete EHR encounters and encounter features. We illustrate that EVA can produce realistic sequences, account for individual differences among patients, and can be conditioned on specific disease conditions. We assess the utility of the methods on large real-world EHR repositories containing over 250, 000 patients.
arXiv Detail & Related papers (2020-12-18T02:37:49Z)
DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences. Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z)
BiteNet: Bidirectional Temporal Encoder Network to Predict Medical Outcomes [53.163089893876645]
We propose a novel self-attention mechanism that captures the contextual dependency and temporal relationships within a patient's healthcare journey. An end-to-end bidirectional temporal encoder network (BiteNet) then learns representations of the patient's journeys. We have evaluated the effectiveness of our methods on two supervised prediction and two unsupervised clustering tasks with a real-world EHR dataset.
arXiv Detail & Related papers (2020-09-24T00:42:36Z)
Uncovering the structure of clinical EEG signals with self-supervised learning [64.4754948595556]
Supervised learning paradigms are often limited by the amount of labeled data that is available. This phenomenon is particularly problematic in clinically-relevant data, such as electroencephalography (EEG) By extracting information from unlabeled data, it might be possible to reach competitive performance with deep neural networks.
arXiv Detail & Related papers (2020-07-31T14:34:47Z)
MASK: A flexible framework to facilitate de-identification of clinical texts [2.3015324171336378]
We present MASK, a software package that is designed to perform the de-identification task. The software is able to perform named entity recognition using some of the state-of-the-art techniques and then mask or redact recognized entities.
arXiv Detail & Related papers (2020-05-24T08:53:00Z)
DeepEnroll: Patient-Trial Matching with Deep Embedding and Entailment Prediction [67.91606509226132]
Clinical trials are essential for drug development but often suffer from expensive, inaccurate and insufficient patient recruitment. DeepEnroll is a cross-modal inference learning model to jointly encode enrollment criteria (tabular data) into a shared latent space for matching inference.
arXiv Detail & Related papers (2020-01-22T17:51:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.