Related papers: Pyclipse, a library for deidentification of free-text clinical notes

Pyclipse, a library for deidentification of free-text clinical notes

URL: http://arxiv.org/abs/2311.02748v1
Date: Sun, 5 Nov 2023 19:56:58 GMT
Title: Pyclipse, a library for deidentification of free-text clinical notes
Authors: Callandra Moore, Jonathan Ranisau, Walter Nelson, Jeremy Petch, Alistair Johnson
Abstract summary: We propose the pyclipse framework to streamline the comparison of deidentification algorithms. Pyclipse serves as a single interface for running open-source deidentification algorithms on local clinical data. We find that algorithm performance consistently falls short of the results reported in the original papers, even when evaluated on the same benchmark dataset.
Score: 0.40329768057075643
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated deidentification of clinical text data is crucial due to the high cost of manual deidentification, which has been a barrier to sharing clinical text and the advancement of clinical natural language processing. However, creating effective automated deidentification tools faces several challenges, including issues in reproducibility due to differences in text processing, evaluation methods, and a lack of consistency across clinical domains and institutions. To address these challenges, we propose the pyclipse framework, a unified and configurable evaluation procedure to streamline the comparison of deidentification algorithms. Pyclipse serves as a single interface for running open-source deidentification algorithms on local clinical data, allowing for context-specific evaluation. To demonstrate the utility of pyclipse, we compare six deidentification algorithms across four public and two private clinical text datasets. We find that algorithm performance consistently falls short of the results reported in the original papers, even when evaluated on the same benchmark dataset. These discrepancies highlight the complexity of accurately assessing and comparing deidentification algorithms, emphasizing the need for a reproducible, adjustable, and extensible framework like pyclipse. Our framework lays the foundation for a unified approach to evaluate and improve deidentification tools, ultimately enhancing patient protection in clinical natural language processing.

Related papers

ConTextual: Improving Clinical Text Summarization in LLMs with Context-preserving Token Filtering and Knowledge Graphs [1.519321208145928]
ConTextual is a novel framework that integrates a Context-Preserving Token Filtering method with a Domain-Specific Knowledge Graph. By preserving context-specific important tokens and enriching them with structured knowledge, ConTextual improves both linguistic coherence and clinical fidelity.
arXiv Detail & Related papers (2025-04-23T03:42:46Z)
AI-assisted summary of suicide risk Formulation [0.9224875902060083]
This study describes how we developed advanced Natural Language Processing (NLP) algorithms, a branch of Artificial Intelligence (AI) Formulation, associated with suicide risk assessment, is an individualised process that seeks to understand the idiosyncratic nature and development of an individual's problems.
arXiv Detail & Related papers (2024-11-29T16:40:28Z)
DIRI: Adversarial Patient Reidentification with Large Language Models for Evaluating Clinical Text Anonymization [13.038800602897354]
We develop an adversarial approach using a large language model to re-identify the patient corresponding to a redacted clinical note. Our method uses a large language model to reidentify the patient corresponding to a redacted clinical note. Although ClinicalBERT was the most effective, masking all identified PII, our tool still reidentified 9% of clinical notes.
arXiv Detail & Related papers (2024-10-22T14:06:31Z)
DeIDClinic: A Multi-Layered Framework for De-identification of Clinical Free-text Data [6.473402241020136]
This work enhances the MASK framework by integrating ClinicalBERT, a deep learning model specifically fine-tuned on clinical texts. The system effectively identifies and either redacts or replaces sensitive identifiable entities within clinical documents. A risk assessment feature has also been developed, which analyses the uniqueness of context within documents to classify them into risk levels.
arXiv Detail & Related papers (2024-10-02T15:16:02Z)
Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [62.32403630651586]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation. Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process. AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z)
Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models [48.07083163501746]
Clinical natural language processing requires methods that can address domain-specific challenges. We propose an innovative, resource-efficient approach, ClinGen, which infuses knowledge into the process. Our empirical study across 7 clinical NLP tasks and 16 datasets reveals that ClinGen consistently enhances performance across various tasks.
arXiv Detail & Related papers (2023-11-01T04:37:28Z)
Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond [66.07002187192448]
We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task. We introduce a strategy for building a specialized vocabulary and introduce a vocabulary merging protocol. We find that our task-adaptive tokenization approach brings a significant improvement in generation performance while using up to 60% fewer tokens.
arXiv Detail & Related papers (2023-10-09T00:20:59Z)
Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning. They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health. Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z)
Detecting automatically the layout of clinical documents to enhance the performances of downstream natural language processing [53.797797404164946]
We designed an algorithm to process clinical PDF documents and extract only clinically relevant text. The algorithm consists of several steps: initial text extraction using a PDF, followed by classification into such categories as body text, left notes, and footers. Medical performance was evaluated by examining the extraction of medical concepts of interest from the text in their respective sections.
arXiv Detail & Related papers (2023-05-23T08:38:33Z)
Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain. We annotated a corpus of clinical documents according to 12 types of identifying entities. We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z)
Interactive Medical Image Segmentation with Self-Adaptive Confidence Calibration [10.297081695050457]
This paper proposes an interactive segmentation framework, called interactive MEdical segmentation with self-adaptive Confidence CAlibration (MECCA) The evaluation is established through a novel action-based confidence network, and the corrective actions are obtained from MARL. Experimental results on various medical image datasets have shown the significant performance of the proposed algorithm.
arXiv Detail & Related papers (2021-11-15T12:38:56Z)
Benchmarking Automated Clinical Language Simplification: Dataset, Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches. We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.