Pyclipse, a library for deidentification of free-text clinical notes
- URL: http://arxiv.org/abs/2311.02748v1
- Date: Sun, 5 Nov 2023 19:56:58 GMT
- Title: Pyclipse, a library for deidentification of free-text clinical notes
- Authors: Callandra Moore, Jonathan Ranisau, Walter Nelson, Jeremy Petch,
Alistair Johnson
- Abstract summary: We propose the pyclipse framework to streamline the comparison of deidentification algorithms.
Pyclipse serves as a single interface for running open-source deidentification algorithms on local clinical data.
We find that algorithm performance consistently falls short of the results reported in the original papers, even when evaluated on the same benchmark dataset.
- Score: 0.40329768057075643
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated deidentification of clinical text data is crucial due to the high
cost of manual deidentification, which has been a barrier to sharing clinical
text and the advancement of clinical natural language processing. However,
creating effective automated deidentification tools faces several challenges,
including issues in reproducibility due to differences in text processing,
evaluation methods, and a lack of consistency across clinical domains and
institutions. To address these challenges, we propose the pyclipse framework, a
unified and configurable evaluation procedure to streamline the comparison of
deidentification algorithms. Pyclipse serves as a single interface for running
open-source deidentification algorithms on local clinical data, allowing for
context-specific evaluation. To demonstrate the utility of pyclipse, we compare
six deidentification algorithms across four public and two private clinical
text datasets. We find that algorithm performance consistently falls short of
the results reported in the original papers, even when evaluated on the same
benchmark dataset. These discrepancies highlight the complexity of accurately
assessing and comparing deidentification algorithms, emphasizing the need for a
reproducible, adjustable, and extensible framework like pyclipse. Our framework
lays the foundation for a unified approach to evaluate and improve
deidentification tools, ultimately enhancing patient protection in clinical
natural language processing.
Related papers
- DIRI: Adversarial Patient Reidentification with Large Language Models for Evaluating Clinical Text Anonymization [13.038800602897354]
We develop an adversarial approach using a large language model to re-identify the patient corresponding to a redacted clinical note.
Our method uses a large language model to reidentify the patient corresponding to a redacted clinical note.
Although ClinicalBERT was the most effective, masking all identified PII, our tool still reidentified 9% of clinical notes.
arXiv Detail & Related papers (2024-10-22T14:06:31Z) - DeIDClinic: A Multi-Layered Framework for De-identification of Clinical Free-text Data [6.473402241020136]
This work enhances the MASK framework by integrating ClinicalBERT, a deep learning model specifically fine-tuned on clinical texts.
The system effectively identifies and either redacts or replaces sensitive identifiable entities within clinical documents.
A risk assessment feature has also been developed, which analyses the uniqueness of context within documents to classify them into risk levels.
arXiv Detail & Related papers (2024-10-02T15:16:02Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text
Summaries [62.32403630651586]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.
Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.
AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data
Generation with Large Language Models [48.07083163501746]
Clinical natural language processing requires methods that can address domain-specific challenges.
We propose an innovative, resource-efficient approach, ClinGen, which infuses knowledge into the process.
Our empirical study across 7 clinical NLP tasks and 16 datasets reveals that ClinGen consistently enhances performance across various tasks.
arXiv Detail & Related papers (2023-11-01T04:37:28Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z) - Detecting automatically the layout of clinical documents to enhance the
performances of downstream natural language processing [53.797797404164946]
We designed an algorithm to process clinical PDF documents and extract only clinically relevant text.
The algorithm consists of several steps: initial text extraction using a PDF, followed by classification into such categories as body text, left notes, and footers.
Medical performance was evaluated by examining the extraction of medical concepts of interest from the text in their respective sections.
arXiv Detail & Related papers (2023-05-23T08:38:33Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - Interactive Medical Image Segmentation with Self-Adaptive Confidence
Calibration [10.297081695050457]
This paper proposes an interactive segmentation framework, called interactive MEdical segmentation with self-adaptive Confidence CAlibration (MECCA)
The evaluation is established through a novel action-based confidence network, and the corrective actions are obtained from MARL.
Experimental results on various medical image datasets have shown the significant performance of the proposed algorithm.
arXiv Detail & Related papers (2021-11-15T12:38:56Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.