De-Identification of French Unstructured Clinical Notes for Machine
Learning Tasks
- URL: http://arxiv.org/abs/2209.09631v2
- Date: Fri, 6 Oct 2023 14:40:52 GMT
- Title: De-Identification of French Unstructured Clinical Notes for Machine
Learning Tasks
- Authors: Yakini Tchouka, Jean-Fran\c{c}ois Couchot, Maxime Coulmeau, David
Laiymani, Philippe Selles, Azzedine Rahmani
- Abstract summary: We propose a new comprehensive de-identification method dedicated to French-language medical documents.
The approach has been evaluated on a French language medical dataset of a French public hospital.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unstructured textual data are at the heart of health systems: liaison letters
between doctors, operating reports, coding of procedures according to the
ICD-10 standard, etc. The details included in these documents make it possible
to get to know the patient better, to better manage him or her, to better study
the pathologies, to accurately remunerate the associated medical acts\ldots All
this seems to be (at least partially) within reach of today by artificial
intelligence techniques. However, for obvious reasons of privacy protection,
the designers of these AIs do not have the legal right to access these
documents as long as they contain identifying data. De-identifying these
documents, i.e. detecting and deleting all identifying information present in
them, is a legally necessary step for sharing this data between two
complementary worlds. Over the last decade, several proposals have been made to
de-identify documents, mainly in English. While the detection scores are often
high, the substitution methods are often not very robust to attack. In French,
very few methods are based on arbitrary detection and/or substitution rules. In
this paper, we propose a new comprehensive de-identification method dedicated
to French-language medical documents. Both the approach for the detection of
identifying elements (based on deep learning) and their substitution (based on
differential privacy) are based on the most proven existing approaches. The
result is an approach that effectively protects the privacy of the patients at
the heart of these medical documents. The whole approach has been evaluated on
a French language medical dataset of a French public hospital and the results
are very encouraging.
Related papers
- Validating transformers for redaction of text from electronic health
records in real-world healthcare [1.561423634851244]
We present AnonCAT, a transformer-based model and a blueprint on how deidentification models can be deployed in real-world healthcare.
AnonCAT was trained through a process involving manually annotated redactions of real-world documents from three UK hospitals.
Our findings demonstrate the potential of deep learning techniques for improving the efficiency and accuracy of redaction in global healthcare data.
arXiv Detail & Related papers (2023-10-05T19:10:18Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 [80.36535668574804]
We develop a novel GPT4-enabled de-identification framework (DeID-GPT")
Our developed DeID-GPT showed the highest accuracy and remarkable reliability in masking private information from the unstructured medical text.
This study is one of the earliest to utilize ChatGPT and GPT-4 for medical text data processing and de-identification.
arXiv Detail & Related papers (2023-03-20T11:34:37Z) - An Easy-to-use and Robust Approach for the Differentially Private
De-Identification of Clinical Textual Documents [0.0]
This paper shows how an efficient and differentially private de-identification approach can be achieved by strengthening the less robust de-identification.
The result is an approach for de-identifying clinical documents in French language, but also generalizable to other languages.
arXiv Detail & Related papers (2022-11-02T14:25:09Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - Towards more patient friendly clinical notes through language models and
ontologies [57.51898902864543]
We present a novel approach to automated medical text based on word simplification and language modelling.
We use a new dataset pairs of publicly available medical sentences and a version of them simplified by clinicians.
Our method based on a language model trained on medical forum data generates simpler sentences while preserving both grammar and the original meaning.
arXiv Detail & Related papers (2021-12-23T16:11:19Z) - Automated Drug-Related Information Extraction from French Clinical
Documents: ReLyfe Approach [0.4588028371034407]
This paper proposes a new approach for extracting drug-related information from French clinical scanned documents.
It is a combination of a rule-based phase and a Deep Learning approach.
arXiv Detail & Related papers (2021-11-29T22:11:23Z) - An Analysis of a BERT Deep Learning Strategy on a Technology Assisted
Review Task [91.3755431537592]
Document screening is a central task within Evidenced Based Medicine.
I propose a DL document classification approach with BERT or PubMedBERT embeddings and a DL similarity search path.
I test and evaluate the retrieval effectiveness of my DL strategy on the 2017 and 2018 CLEF eHealth collections.
arXiv Detail & Related papers (2021-04-16T19:45:27Z) - Comparing Rule-based, Feature-based and Deep Neural Methods for
De-identification of Dutch Medical Records [4.339510167603376]
We construct a varied dataset consisting of the medical records of 1260 patients by sampling data from 9 institutes and three domains of Dutch healthcare.
We test the generalizability of three de-identification methods across languages and domains.
arXiv Detail & Related papers (2020-01-16T09:42:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.