An Easy-to-use and Robust Approach for the Differentially Private
De-Identification of Clinical Textual Documents
- URL: http://arxiv.org/abs/2211.01147v1
- Date: Wed, 2 Nov 2022 14:25:09 GMT
- Title: An Easy-to-use and Robust Approach for the Differentially Private
De-Identification of Clinical Textual Documents
- Authors: Yakini Tchouka, Jean-Fran\c{c}ois Couchot and David Laiymani
- Abstract summary: This paper shows how an efficient and differentially private de-identification approach can be achieved by strengthening the less robust de-identification.
The result is an approach for de-identifying clinical documents in French language, but also generalizable to other languages.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Unstructured textual data is at the heart of healthcare systems. For obvious
privacy reasons, these documents are not accessible to researchers as long as
they contain personally identifiable information. One way to share this data
while respecting the legislative framework (notably GDPR or HIPAA) is, within
the medical structures, to de-identify it, i.e. to detect the personal
information of a person through a Named Entity Recognition (NER) system and
then replacing it to make it very difficult to associate the document with the
person. The challenge is having reliable NER and substitution tools without
compromising confidentiality and consistency in the document. Most of the
conducted research focuses on English medical documents with coarse
substitutions by not benefiting from advances in privacy. This paper shows how
an efficient and differentially private de-identification approach can be
achieved by strengthening the less robust de-identification method and by
adapting state-of-the-art differentially private mechanisms for substitution
purposes. The result is an approach for de-identifying clinical documents in
French language, but also generalizable to other languages and whose robustness
is mathematically proven.
Related papers
- Multiview Identifiers Enhanced Generative Retrieval [78.38443356800848]
generative retrieval generates identifier strings of passages as the retrieval target.
We propose a new type of identifier, synthetic identifiers, that are generated based on the content of a passage.
Our proposed approach performs the best in generative retrieval, demonstrating its effectiveness and robustness.
arXiv Detail & Related papers (2023-05-26T06:50:21Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 [80.36535668574804]
We develop a novel GPT4-enabled de-identification framework (DeID-GPT")
Our developed DeID-GPT showed the highest accuracy and remarkable reliability in masking private information from the unstructured medical text.
This study is one of the earliest to utilize ChatGPT and GPT-4 for medical text data processing and de-identification.
arXiv Detail & Related papers (2023-03-20T11:34:37Z) - Unsupervised Text Deidentification [101.2219634341714]
We propose an unsupervised deidentification method that masks words that leak personally-identifying information.
Motivated by K-anonymity based privacy, we generate redactions that ensure a minimum reidentification rank.
arXiv Detail & Related papers (2022-10-20T18:54:39Z) - De-Identification of French Unstructured Clinical Notes for Machine
Learning Tasks [0.0]
We propose a new comprehensive de-identification method dedicated to French-language medical documents.
The approach has been evaluated on a French language medical dataset of a French public hospital.
arXiv Detail & Related papers (2022-09-16T13:00:47Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - Towards more patient friendly clinical notes through language models and
ontologies [57.51898902864543]
We present a novel approach to automated medical text based on word simplification and language modelling.
We use a new dataset pairs of publicly available medical sentences and a version of them simplified by clinicians.
Our method based on a language model trained on medical forum data generates simpler sentences while preserving both grammar and the original meaning.
arXiv Detail & Related papers (2021-12-23T16:11:19Z) - Performance of Automatic De-identification Across Different Note Types [0.8399688944263842]
Concerns about patient privacy and confidentiality limit the use of clinical notes for research.
We present the performance of a state-of-the art de-id system called NeuroNER1 on a diverse set of notes from University of Washington.
arXiv Detail & Related papers (2021-02-17T00:55:40Z) - MASK: A flexible framework to facilitate de-identification of clinical
texts [2.3015324171336378]
We present MASK, a software package that is designed to perform the de-identification task.
The software is able to perform named entity recognition using some of the state-of-the-art techniques and then mask or redact recognized entities.
arXiv Detail & Related papers (2020-05-24T08:53:00Z) - Comparing Rule-based, Feature-based and Deep Neural Methods for
De-identification of Dutch Medical Records [4.339510167603376]
We construct a varied dataset consisting of the medical records of 1260 patients by sampling data from 9 institutes and three domains of Dutch healthcare.
We test the generalizability of three de-identification methods across languages and domains.
arXiv Detail & Related papers (2020-01-16T09:42:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.