Improving the Performance of Radiology Report De-identification with Large-Scale Training and Benchmarking Against Cloud Vendor Methods
- URL: http://arxiv.org/abs/2511.04079v1
- Date: Thu, 06 Nov 2025 05:37:26 GMT
- Title: Improving the Performance of Radiology Report De-identification with Large-Scale Training and Benchmarking Against Cloud Vendor Methods
- Authors: Eva Prakash, Maayane Attias, Pierre Chambon, Justin Xu, Steven Truong, Jean-Benoit Delbrouck, Tessa Cook, Curtis Langlotz,
- Abstract summary: We build upon a state-of-the-art, transformer-based, PHI de-identification pipeline by fine-tuning on two large annotated radiology corpora from Stanford University.<n>Model performance was evaluated on test sets from Stanford and the University of Pennsylvania for token-level PHI detection.
- Score: 4.980073263011964
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Objective: To enhance automated de-identification of radiology reports by scaling transformer-based models through extensive training datasets and benchmarking performance against commercial cloud vendor systems for protected health information (PHI) detection. Materials and Methods: In this retrospective study, we built upon a state-of-the-art, transformer-based, PHI de-identification pipeline by fine-tuning on two large annotated radiology corpora from Stanford University, encompassing chest X-ray, chest CT, abdomen/pelvis CT, and brain MR reports and introducing an additional PHI category (AGE) into the architecture. Model performance was evaluated on test sets from Stanford and the University of Pennsylvania (Penn) for token-level PHI detection. We further assessed (1) the stability of synthetic PHI generation using a "hide-in-plain-sight" method and (2) performance against commercial systems. Precision, recall, and F1 scores were computed across all PHI categories. Results: Our model achieved overall F1 scores of 0.973 on the Penn dataset and 0.996 on the Stanford dataset, outperforming or maintaining the previous state-of-the-art model performance. Synthetic PHI evaluation showed consistent detectability (overall F1: 0.959 [0.958-0.960]) across 50 independently de-identified Penn datasets. Our model outperformed all vendor systems on synthetic Penn reports (overall F1: 0.960 vs. 0.632-0.754). Discussion: Large-scale, multimodal training improved cross-institutional generalization and robustness. Synthetic PHI generation preserved data utility while ensuring privacy. Conclusion: A transformer-based de-identification model trained on diverse radiology datasets outperforms prior academic and commercial systems in PHI detection and establishes a new benchmark for secure clinical text processing.
Related papers
- Attention-Based Deep Learning for Early Parkinson's Disease Detection with Tabular Biomedical Data [0.0]
Early and accurate detection of Parkinson's disease (PD) remains a critical challenge in medical diagnostics.<n>Traditional machine learning (ML) models, though widely applied to PD detection, often rely on extensive feature engineering and struggle to capture complex feature interactions.<n>We present a comparative evaluation of four classification models: Multi-Layer Perceptron (MLP), Gradient Boosting, TabNet, and SAINT.
arXiv Detail & Related papers (2026-02-08T12:03:02Z) - Validating Vision Transformers for Otoscopy: Performance and Data-Leakage Effects [42.465094107111646]
This study evaluates the efficacy of vision transformer models, specifically Swin transformers, in enhancing the diagnostic accuracy of ear diseases.<n>The research utilised a real-world dataset from the Department of Otolaryngology at the Clinical Hospital of the Universidad de Chile.
arXiv Detail & Related papers (2025-11-06T23:20:37Z) - Skin Cancer Classification: Hybrid CNN-Transformer Models with KAN-Based Fusion [0.0]
We explore Sequential and Parallel Hybrid CNN-Transformer models with Convolutional Kolmogorov-Arnold Network (CKAN)<n>Our approach integrates transfer learning and extensive data augmentation, where CNNs extract local spatial features, Transformers model global dependencies, and CKAN facilitates nonlinear feature fusion for improved representation learning.<n>Our proposed approach achieves competitive performance in skin cancer classification, demonstrating 92.81% accuracy and 92.47% F1-score on the HAM10000 dataset, 97.83% accuracy and 97.83% F1-score on the PAD-UFES dataset, and 91.17% accuracy with 91.79% F1- score on
arXiv Detail & Related papers (2025-08-17T19:57:34Z) - PySeizure: A single machine learning classifier framework to detect seizures in diverse datasets [0.0]
We introduce an innovative, open-source machine-learning framework that enables robust seizure detection across varied clinical datasets.<n>To enhance robustness, the framework incorporates an automated pre-processing pipeline to standardise data and a majority voting mechanism.<n>We train, tune, and evaluate models within each dataset, assessing their cross-dataset transferability.
arXiv Detail & Related papers (2025-08-10T09:12:29Z) - BioSerenity-E1: a self-supervised EEG model for medical applications [0.0]
BioSerenity-E1 is a family of self-supervised foundation models for clinical EEG applications.<n>It combines spectral tokenization with masked prediction to achieve state-of-the-art performance across relevant diagnostic tasks.
arXiv Detail & Related papers (2025-03-13T13:42:46Z) - Simulated patient systems are intelligent when powered by large language model-based AI agents [32.73072809937573]
We developed AIPatient, an intelligent simulated patient system powered by large language model-based AI agents.<n>The system incorporates the Retrieval Augmented Generation framework, powered by six task-specific LLM-based AI agents for complex reasoning.<n>For simulation reality, the system is also powered by the AIPatient KG (Knowledge Graph), built with de-identified real patient data.
arXiv Detail & Related papers (2024-09-27T17:17:15Z) - Phikon-v2, A large and public feature extractor for biomarker prediction [42.52549987351643]
We train a vision transformer using DINOv2 and publicly release one iteration of this model for further experimentation, coined Phikon-v2.
While trained on publicly available histology slides, Phikon-v2 surpasses our previously released model (Phikon) and performs on par with other histopathology foundation models (FM) trained on proprietary data.
arXiv Detail & Related papers (2024-09-13T20:12:29Z) - Machine Learning for ALSFRS-R Score Prediction: Making Sense of the Sensor Data [44.99833362998488]
Amyotrophic Lateral Sclerosis (ALS) is a rapidly progressive neurodegenerative disease that presents individuals with limited treatment options.
The present investigation, spearheaded by the iDPP@CLEF 2024 challenge, focuses on utilizing sensor-derived data obtained through an app.
arXiv Detail & Related papers (2024-07-10T19:17:23Z) - A Federated Learning Framework for Stenosis Detection [70.27581181445329]
This study explores the use of Federated Learning (FL) for stenosis detection in coronary angiography images (CA)
Two heterogeneous datasets from two institutions were considered: dataset 1 includes 1219 images from 200 patients, which we acquired at the Ospedale Riuniti of Ancona (Italy)
dataset 2 includes 7492 sequential images from 90 patients from a previous study available in the literature.
arXiv Detail & Related papers (2023-10-30T11:13:40Z) - The Utility of the Virtual Imaging Trials Methodology for Objective Characterization of AI Systems and Training Data [1.6040478776985583]
The study was conducted for the case example of COVID-19 diagnosis using clinical and virtual computed tomography (CT) and chest radiography (CXR) processed with convolutional neural networks.<n>Multiple AI models were developed and tested using 3D ResNet-like and 2D EfficientNetv2 architectures across diverse datasets.<n>The VIT approach can be used to enhance model transparency and reliability, offering nuanced insights into the factors driving AI performance and bridging the gap between experimental and clinical settings.
arXiv Detail & Related papers (2023-08-17T19:12:32Z) - Clinical Concept and Relation Extraction Using Prompt-based Machine
Reading Comprehension [38.79665143111312]
We formulate both clinical concept extraction and relation extraction using a unified prompt-based machine reading comprehension architecture.
We compare our MRC models with existing deep learning models for concept extraction and end-to-end relation extraction.
We evaluate the transfer learning ability of the proposed MRC models in a cross-institution setting.
arXiv Detail & Related papers (2023-03-14T22:37:31Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.