Medical Image De-Identification Resources: Synthetic DICOM Data and Tools for Validation
- URL: http://arxiv.org/abs/2508.01889v1
- Date: Sun, 03 Aug 2025 18:48:28 GMT
- Title: Medical Image De-Identification Resources: Synthetic DICOM Data and Tools for Validation
- Authors: Michael W. Rutherford, Tracy Nolan, Linmin Pei, Ulrike Wagner, Qinyan Pan, Phillip Farmer, Kirk Smith, Benjamin Kopchick, Laura Opsahl-Ong, Granger Sutton, David Clunie, Keyvan Farahani, Fred Prior,
- Abstract summary: Ensuring patient privacy remains a significant challenge for open-access data sharing.<n>Digital Imaging and Communications in Medicine (DICOM) encodes both essential clinical metadata and extensive protected health information (PHI) and personally identifiable information (PII)<n>To address this gap, we developed an openly accessible DICOM dataset infused with synthetic PHI/PII and an evaluation framework for benchmarking image de-identification.
- Score: 0.10617782943195009
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Medical imaging research increasingly depends on large-scale data sharing to promote reproducibility and train Artificial Intelligence (AI) models. Ensuring patient privacy remains a significant challenge for open-access data sharing. Digital Imaging and Communications in Medicine (DICOM), the global standard data format for medical imaging, encodes both essential clinical metadata and extensive protected health information (PHI) and personally identifiable information (PII). Effective de-identification must remove identifiers, preserve scientific utility, and maintain DICOM validity. Tools exist to perform de-identification, but few assess its effectiveness, and most rely on subjective reviews, limiting reproducibility and regulatory confidence. To address this gap, we developed an openly accessible DICOM dataset infused with synthetic PHI/PII and an evaluation framework for benchmarking image de-identification workflows. The Medical Image de-identification (MIDI) dataset was built using publicly available de-identified data from The Cancer Imaging Archive (TCIA). It includes 538 subjects (216 for validation, 322 for testing), 605 studies, 708 series, and 53,581 DICOM image instances. These span multiple vendors, imaging modalities, and cancer types. Synthetic PHI and PII were embedded into structured data elements, plain text data elements, and pixel data to simulate real-world identity leaks encountered by TCIA curation teams. Accompanying evaluation tools include a Python script, answer keys (known truth), and mapping files that enable automated comparison of curated data against expected transformations. The framework is aligned with the HIPAA Privacy Rule "Safe Harbor" method, DICOM PS3.15 Confidentiality Profiles, and TCIA best practices. It supports objective, standards-driven evaluation of de-identification workflows, promoting safer and more consistent medical image sharing.
Related papers
- Deep classification algorithm for De-identification of DICOM medical images [0.0]
De-identification of DICOM files is an essential component of medical image research.<n>The most sensible information, like names, history, personal data and institution were successfully recognized.
arXiv Detail & Related papers (2025-08-04T08:21:18Z) - DICOM De-Identification via Hybrid AI and Rule-Based Framework for Scalable, Uncertainty-Aware Redaction [0.0]
This paper presents a hybrid de-identification framework that combines rule-based and AI-driven techniques.<n>Our solution addresses critical challenges in medical data de-identification and supports the secure, ethical, and trustworthy release of imaging data for research.
arXiv Detail & Related papers (2025-07-31T17:19:38Z) - Medical Image De-Identification Benchmark Challenge [1.491270549044044]
The aim of the MIDI-B Challenge was to provide a standardized platform for benchmarking of DICOM image deID tools.<n>The challenge employed a large, diverse, multi-center, and multi-modality set of real de-identified radiology images with synthetic PHI/PII inserted.<n>Ten teams successfully completed the test phase of the challenge.
arXiv Detail & Related papers (2025-07-31T14:47:20Z) - UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities [68.12889379702824]
Vision-Language Models (VLMs) trained via contrastive learning have achieved notable success in natural image tasks.<n>UniMed is a large-scale, open-source multi-modal medical dataset comprising over 5.3 million image-text pairs.<n>We trained UniMed-CLIP, a unified VLM for six modalities, achieving notable gains in zero-shot evaluations.
arXiv Detail & Related papers (2024-12-13T18:59:40Z) - Clinical Evaluation of Medical Image Synthesis: A Case Study in Wireless Capsule Endoscopy [63.39037092484374]
Synthetic Data Generation based on Artificial Intelligence (AI) can transform the way clinical medicine is delivered.<n>This study focuses on the clinical evaluation of medical SDG, with a proof-of-concept investigation on diagnosing Inflammatory Bowel Disease (IBD) using Wireless Capsule Endoscopy (WCE) images.<n>The results show that TIDE-II generates clinically plausible, very realistic WCE images, of improved quality compared to relevant state-of-the-art generative models.
arXiv Detail & Related papers (2024-10-31T19:48:50Z) - De-Identification of Medical Imaging Data: A Comprehensive Tool for Ensuring Patient Privacy [4.376648893167674]
Open-source tool can be used to de-identify DICOM magnetic resonance images, computer images, whole slide images and magnetic resonance twix raw data.
Proposal comprises an elaborate anonymization pipeline for multiple types of inputs, reducing the need for additional tools used for de-identification of imaging data.
arXiv Detail & Related papers (2024-10-16T09:31:24Z) - Radiology Report Generation Using Transformers Conditioned with
Non-imaging Data [55.17268696112258]
This paper proposes a novel multi-modal transformer network that integrates chest x-ray (CXR) images and associated patient demographic information.
The proposed network uses a convolutional neural network to extract visual features from CXRs and a transformer-based encoder-decoder network that combines the visual features with semantic text embeddings of patient demographic information.
arXiv Detail & Related papers (2023-11-18T14:52:26Z) - DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 [80.36535668574804]
We develop a novel GPT4-enabled de-identification framework (DeID-GPT")
Our developed DeID-GPT showed the highest accuracy and remarkable reliability in masking private information from the unstructured medical text.
This study is one of the earliest to utilize ChatGPT and GPT-4 for medical text data processing and de-identification.
arXiv Detail & Related papers (2023-03-20T11:34:37Z) - Report of the Medical Image De-Identification (MIDI) Task Group -- Best Practices and Recommendations [2.0719223149506028]
This report addresses the technical aspects of de-identification of medical images of human subjects and biospecimens.<n>Only de-identification of publicly released data is considered.<n>Alternative approaches to privacy, such as federated learning for artificial intelligence (AI) model development, are out of scope.
arXiv Detail & Related papers (2023-03-18T19:12:38Z) - ConfounderGAN: Protecting Image Data Privacy with Causal Confounder [85.6757153033139]
We propose ConfounderGAN, a generative adversarial network (GAN) that can make personal image data unlearnable to protect the data privacy of its owners.
Experiments are conducted in six image classification datasets, consisting of three natural object datasets and three medical datasets.
arXiv Detail & Related papers (2022-12-04T08:49:14Z) - Privacy-preserving medical image analysis [53.4844489668116]
We present PriMIA, a software framework designed for privacy-preserving machine learning (PPML) in medical imaging.
We show significantly better classification performance of a securely aggregated federated learning model compared to human experts on unseen datasets.
We empirically evaluate the framework's security against a gradient-based model inversion attack.
arXiv Detail & Related papers (2020-12-10T13:56:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.