RedactOR: An LLM-Powered Framework for Automatic Clinical Data De-Identification
- URL: http://arxiv.org/abs/2505.18380v2
- Date: Thu, 24 Jul 2025 22:25:37 GMT
- Title: RedactOR: An LLM-Powered Framework for Automatic Clinical Data De-Identification
- Authors: Praphul Singh, Charlotte Dzialo, Jangwon Kim, Sumana Srivatsa, Irfan Bulu, Sri Gadde, Krishnaram Kenthapadi,
- Abstract summary: We propose a fully automated framework, RedactOR for de-identifying structured and unstructured electronic health records.<n>Our framework employs cost-efficient De-ID strategies, including intelligent routing, hybrid rule and LLM based approaches.<n>We present a retrieval-based entity relexicalization approach to ensure consistent substitutions of protected entities.
- Score: 10.378433440829712
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Ensuring clinical data privacy while preserving utility is critical for AI-driven healthcare and data analytics. Existing de-identification (De-ID) methods, including rule-based techniques, deep learning models, and large language models (LLMs), often suffer from recall errors, limited generalization, and inefficiencies, limiting their real-world applicability. We propose a fully automated, multi-modal framework, RedactOR for de-identifying structured and unstructured electronic health records, including clinical audio records. Our framework employs cost-efficient De-ID strategies, including intelligent routing, hybrid rule and LLM based approaches, and a two-step audio redaction approach. We present a retrieval-based entity relexicalization approach to ensure consistent substitutions of protected entities, thereby enhancing data coherence for downstream applications. We discuss key design desiderata, de-identification and relexicalization methodology, and modular architecture of RedactOR and its integration with the Oracle Health Clinical AI system. Evaluated on the i2b2 2014 De-ID dataset using standard metrics with strict recall, our approach achieves competitive performance while optimizing token usage to reduce LLM costs. Finally, we discuss key lessons and insights from deployment in real-world AI- driven healthcare data pipelines.
Related papers
- NEARL-CLIP: Interacted Query Adaptation with Orthogonal Regularization for Medical Vision-Language Understanding [51.63264715941068]
textbfNEARL-CLIP (iunderlineNteracted quunderlineEry underlineAdaptation with ounderlineRthogonaunderlineL Regularization) is a novel cross-modality interaction VLM-based framework.
arXiv Detail & Related papers (2025-08-06T05:44:01Z) - Ensuring Reliability of Curated EHR-Derived Data: The Validation of Accuracy for LLM/ML-Extracted Information and Data (VALID) Framework [0.0]
We propose a comprehensive framework for evaluating the quality of clinical data extracted by large language models (LLMs)<n>The framework integrates variable-level performance benchmarking against expert human abstraction, automated verification checks for internal consistency and plausibility, and replication analyses.<n>This multidimensional approach enables the identification of variables most in need of improvement, systematic detection of latent errors, and confirmation of dataset fitness-for-purpose in real-world research.
arXiv Detail & Related papers (2025-06-09T20:59:16Z) - TrialMatchAI: An End-to-End AI-powered Clinical Trial Recommendation System to Streamline Patient-to-Trial Matching [0.0]
We present TrialMatchAI, an AI-powered recommendation system that automates patient-to-trial matching.<n>Built on fine-tuned, open-source large language models, TrialMatchAI ensures transparency and maintains a lightweight deployment footprint.<n>In real-world validation, 92 percent of oncology patients had at least one relevant trial retrieved within the top 20 recommendations.
arXiv Detail & Related papers (2025-05-13T12:39:06Z) - Semantic Integrity Constraints: Declarative Guardrails for AI-Augmented Data Processing Systems [39.23499993745249]
We introduce Semantic Integrity Constraints (SICs) to govern and optimize semantic operators within AI-augmented data processing systems.<n>SICs integrate seamlessly into the relational model, allowing users to specify common classes of constraints.<n>Our work establishes SICs as a foundational framework for trustworthy, high-performance AI-augmented data processing.
arXiv Detail & Related papers (2025-03-01T19:59:25Z) - New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration [49.180693704510006]
Referring Expression (REC) is a cross-modal task that evaluates the interplay of language understanding, image comprehension, and language-to-image grounding.<n>We introduce a new REC dataset with two key features. First, it is designed with controllable difficulty levels, requiring fine-grained reasoning across object categories, attributes, and relationships.<n>Second, it incorporates negative text and images generated through fine-grained editing, explicitly testing a model's ability to reject non-existent targets.
arXiv Detail & Related papers (2025-02-27T13:58:44Z) - LLMs for Generalizable Language-Conditioned Policy Learning under Minimal Data Requirements [50.544186914115045]
This paper presents TEDUO, a novel training pipeline for offline language-conditioned policy learning.<n>TEDUO operates on easy-to-obtain, unlabeled datasets and is suited for the so-called in-the-wild evaluation, wherein the agent encounters previously unseen goals and states.
arXiv Detail & Related papers (2024-12-09T18:43:56Z) - DIRI: Adversarial Patient Reidentification with Large Language Models for Evaluating Clinical Text Anonymization [13.038800602897354]
We develop an adversarial approach using a large language model to re-identify the patient corresponding to a redacted clinical note.
Our method uses a large language model to reidentify the patient corresponding to a redacted clinical note.
Although ClinicalBERT was the most effective, masking all identified PII, our tool still reidentified 9% of clinical notes.
arXiv Detail & Related papers (2024-10-22T14:06:31Z) - DeIDClinic: A Multi-Layered Framework for De-identification of Clinical Free-text Data [6.473402241020136]
This work enhances the MASK framework by integrating ClinicalBERT, a deep learning model specifically fine-tuned on clinical texts.
The system effectively identifies and either redacts or replaces sensitive identifiable entities within clinical documents.
A risk assessment feature has also been developed, which analyses the uniqueness of context within documents to classify them into risk levels.
arXiv Detail & Related papers (2024-10-02T15:16:02Z) - CART: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
Cross-modal retrieval aims to search for instances, which are semantically related to the query through the interaction of different modal data.<n>Traditional solutions utilize a single-tower or dual-tower framework to explicitly compute the score between queries and candidates.<n>We propose a generative cross-modal retrieval framework (CART) based on coarse-to-fine semantic modeling.
arXiv Detail & Related papers (2024-06-25T12:47:04Z) - ACR: A Benchmark for Automatic Cohort Retrieval [1.3547712404175771]
Current cohort retrieval methods rely on automated queries of structured data combined with manual curation.
Recent advancements in large language models (LLMs) and information retrieval (IR) offer promising avenues to revolutionize these systems.
This paper introduces a new task, Automatic Cohort Retrieval (ACR), and evaluates the performance of LLMs and commercial, domain-specific neuro-symbolic approaches.
arXiv Detail & Related papers (2024-06-20T23:04:06Z) - CoRelation: Boosting Automatic ICD Coding Through Contextualized Code
Relation Learning [56.782963838838036]
We propose a novel approach, a contextualized and flexible framework, to enhance the learning of ICD code representations.
Our approach employs a dependent learning paradigm that considers the context of clinical notes in modeling all possible code relations.
arXiv Detail & Related papers (2024-02-24T03:25:28Z) - Adapting LLMs for Efficient, Personalized Information Retrieval: Methods
and Implications [0.7832189413179361]
Large Language Models (LLMs) excel in comprehending and generating human-like text.
This paper explores strategies for integrating Language Models (LLMs) with Information Retrieval (IR) systems.
arXiv Detail & Related papers (2023-11-21T02:01:01Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Federated Offline Reinforcement Learning [55.326673977320574]
We propose a multi-site Markov decision process model that allows for both homogeneous and heterogeneous effects across sites.
We design the first federated policy optimization algorithm for offline RL with sample complexity.
We give a theoretical guarantee for the proposed algorithm, where the suboptimality for the learned policies is comparable to the rate as if data is not distributed.
arXiv Detail & Related papers (2022-06-11T18:03:26Z) - A Meta-embedding-based Ensemble Approach for ICD Coding Prediction [64.42386426730695]
International Classification of Diseases (ICD) are the de facto codes used globally for clinical coding.
These codes enable healthcare providers to claim reimbursement and facilitate efficient storage and retrieval of diagnostic information.
Our proposed approach enhances the performance of neural models by effectively training word vectors using routine medical data as well as external knowledge from scientific articles.
arXiv Detail & Related papers (2021-02-26T17:49:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.