Validating transformers for redaction of text from electronic health
records in real-world healthcare
- URL: http://arxiv.org/abs/2310.04468v1
- Date: Thu, 5 Oct 2023 19:10:18 GMT
- Title: Validating transformers for redaction of text from electronic health
records in real-world healthcare
- Authors: Zeljko Kraljevic, Anthony Shek, Joshua Au Yeung, Ewart Jonathan
Sheldon, Mohammad Al-Agil, Haris Shuaib, Xi Bai, Kawsar Noor, Anoop D. Shah,
Richard Dobson, James Teo
- Abstract summary: We present AnonCAT, a transformer-based model and a blueprint on how deidentification models can be deployed in real-world healthcare.
AnonCAT was trained through a process involving manually annotated redactions of real-world documents from three UK hospitals.
Our findings demonstrate the potential of deep learning techniques for improving the efficiency and accuracy of redaction in global healthcare data.
- Score: 1.561423634851244
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Protecting patient privacy in healthcare records is a top priority, and
redaction is a commonly used method for obscuring directly identifiable
information in text. Rule-based methods have been widely used, but their
precision is often low causing over-redaction of text and frequently not being
adaptable enough for non-standardised or unconventional structures of personal
health information. Deep learning techniques have emerged as a promising
solution, but implementing them in real-world environments poses challenges due
to the differences in patient record structure and language across different
departments, hospitals, and countries.
In this study, we present AnonCAT, a transformer-based model and a blueprint
on how deidentification models can be deployed in real-world healthcare.
AnonCAT was trained through a process involving manually annotated redactions
of real-world documents from three UK hospitals with different electronic
health record systems and 3116 documents. The model achieved high performance
in all three hospitals with a Recall of 0.99, 0.99 and 0.96.
Our findings demonstrate the potential of deep learning techniques for
improving the efficiency and accuracy of redaction in global healthcare data
and highlight the importance of building workflows which not just use these
models but are also able to continually fine-tune and audit the performance of
these algorithms to ensure continuing effectiveness in real-world settings.
This approach provides a blueprint for the real-world use of de-identifying
algorithms through fine-tuning and localisation, the code together with
tutorials is available on GitHub (https://github.com/CogStack/MedCAT).
Related papers
- Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 [80.36535668574804]
We develop a novel GPT4-enabled de-identification framework (DeID-GPT")
Our developed DeID-GPT showed the highest accuracy and remarkable reliability in masking private information from the unstructured medical text.
This study is one of the earliest to utilize ChatGPT and GPT-4 for medical text data processing and de-identification.
arXiv Detail & Related papers (2023-03-20T11:34:37Z) - De-Identification of French Unstructured Clinical Notes for Machine
Learning Tasks [0.0]
We propose a new comprehensive de-identification method dedicated to French-language medical documents.
The approach has been evaluated on a French language medical dataset of a French public hospital.
arXiv Detail & Related papers (2022-09-16T13:00:47Z) - Classifying Unstructured Clinical Notes via Automatic Weak Supervision [17.45660355026785]
We introduce a general weakly-supervised text classification framework that learns from class-label descriptions only.
We leverage the linguistic domain knowledge stored within pre-trained language models and the data programming framework to assign code labels to texts.
arXiv Detail & Related papers (2022-06-24T05:55:49Z) - BERT WEAVER: Using WEight AVERaging to enable lifelong learning for
transformer-based models in biomedical semantic search engines [49.75878234192369]
We present WEAVER, a simple, yet efficient post-processing method that infuses old knowledge into the new model.
We show that applying WEAVER in a sequential manner results in similar word embedding distributions as doing a combined training on all data at once.
arXiv Detail & Related papers (2022-02-21T10:34:41Z) - Towards more patient friendly clinical notes through language models and
ontologies [57.51898902864543]
We present a novel approach to automated medical text based on word simplification and language modelling.
We use a new dataset pairs of publicly available medical sentences and a version of them simplified by clinicians.
Our method based on a language model trained on medical forum data generates simpler sentences while preserving both grammar and the original meaning.
arXiv Detail & Related papers (2021-12-23T16:11:19Z) - A Meta-embedding-based Ensemble Approach for ICD Coding Prediction [64.42386426730695]
International Classification of Diseases (ICD) are the de facto codes used globally for clinical coding.
These codes enable healthcare providers to claim reimbursement and facilitate efficient storage and retrieval of diagnostic information.
Our proposed approach enhances the performance of neural models by effectively training word vectors using routine medical data as well as external knowledge from scientific articles.
arXiv Detail & Related papers (2021-02-26T17:49:58Z) - Learning Contextualized Document Representations for Healthcare Answer
Retrieval [68.02029435111193]
Contextual Discourse Vectors (CDV) is a distributed document representation for efficient answer retrieval from long documents.
Our model leverages a dual encoder architecture with hierarchical LSTM layers and multi-task training to encode the position of clinical entities and aspects alongside the document discourse.
We show that our generalized model significantly outperforms several state-of-the-art baselines for healthcare passage ranking.
arXiv Detail & Related papers (2020-02-03T15:47:19Z) - Comparing Rule-based, Feature-based and Deep Neural Methods for
De-identification of Dutch Medical Records [4.339510167603376]
We construct a varied dataset consisting of the medical records of 1260 patients by sampling data from 9 institutes and three domains of Dutch healthcare.
We test the generalizability of three de-identification methods across languages and domains.
arXiv Detail & Related papers (2020-01-16T09:42:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.