Comparing Rule-based, Feature-based and Deep Neural Methods for
De-identification of Dutch Medical Records
- URL: http://arxiv.org/abs/2001.05714v1
- Date: Thu, 16 Jan 2020 09:42:29 GMT
- Title: Comparing Rule-based, Feature-based and Deep Neural Methods for
De-identification of Dutch Medical Records
- Authors: Jan Trienes, Dolf Trieschnigg, Christin Seifert, Djoerd Hiemstra
- Abstract summary: We construct a varied dataset consisting of the medical records of 1260 patients by sampling data from 9 institutes and three domains of Dutch healthcare.
We test the generalizability of three de-identification methods across languages and domains.
- Score: 4.339510167603376
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unstructured information in electronic health records provide an invaluable
resource for medical research. To protect the confidentiality of patients and
to conform to privacy regulations, de-identification methods automatically
remove personally identifying information from these medical records. However,
due to the unavailability of labeled data, most existing research is
constrained to English medical text and little is known about the
generalizability of de-identification methods across languages and domains. In
this study, we construct a varied dataset consisting of the medical records of
1260 patients by sampling data from 9 institutes and three domains of Dutch
healthcare. We test the generalizability of three de-identification methods
across languages and domains. Our experiments show that an existing rule-based
method specifically developed for the Dutch language fails to generalize to
this new data. Furthermore, a state-of-the-art neural architecture performs
strongly across languages and domains, even with limited training data.
Compared to feature-based and rule-based methods the neural method requires
significantly less configuration effort and domain-knowledge. We make all code
and pre-trained de-identification models available to the research community,
allowing practitioners to apply them to their datasets and to enable future
benchmarks.
Related papers
- Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Comprehensive Study on German Language Models for Clinical and Biomedical Text Understanding [16.220303664681172]
We pre-trained several German medical language models on 2.4B tokens derived from translated public English medical data and 3B tokens of German clinical data.
The resulting models were evaluated on various German downstream tasks, including named entity recognition (NER), multi-label classification, and extractive question answering.
We conclude that continuous pre-training has demonstrated the ability to match or even exceed the performance of clinical models trained from scratch.
arXiv Detail & Related papers (2024-04-08T17:24:04Z) - Advancing Italian Biomedical Information Extraction with
Transformers-based Models: Methodological Insights and Multicenter Practical
Application [0.27027468002793437]
Information Extraction can help clinical practitioners overcome the limitation by using automated text-mining pipelines.
We created the first Italian neuropsychiatric Named Entity Recognition dataset, PsyNIT, and used it to develop a Transformers-based model.
The lessons learned are: (i) the crucial role of a consistent annotation process and (ii) a fine-tuning strategy that combines classical methods with a "low-resource" approach.
arXiv Detail & Related papers (2023-06-08T16:15:46Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 [80.36535668574804]
We develop a novel GPT4-enabled de-identification framework (DeID-GPT")
Our developed DeID-GPT showed the highest accuracy and remarkable reliability in masking private information from the unstructured medical text.
This study is one of the earliest to utilize ChatGPT and GPT-4 for medical text data processing and de-identification.
arXiv Detail & Related papers (2023-03-20T11:34:37Z) - De-Identification of French Unstructured Clinical Notes for Machine
Learning Tasks [0.0]
We propose a new comprehensive de-identification method dedicated to French-language medical documents.
The approach has been evaluated on a French language medical dataset of a French public hospital.
arXiv Detail & Related papers (2022-09-16T13:00:47Z) - Towards more patient friendly clinical notes through language models and
ontologies [57.51898902864543]
We present a novel approach to automated medical text based on word simplification and language modelling.
We use a new dataset pairs of publicly available medical sentences and a version of them simplified by clinicians.
Our method based on a language model trained on medical forum data generates simpler sentences while preserving both grammar and the original meaning.
arXiv Detail & Related papers (2021-12-23T16:11:19Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.