Related papers: Generation and De-Identification of Indian Clinical Discharge Summaries using LLMs

Generation and De-Identification of Indian Clinical Discharge Summaries using LLMs

URL: http://arxiv.org/abs/2407.05887v1
Date: Mon, 8 Jul 2024 12:47:03 GMT
Title: Generation and De-Identification of Indian Clinical Discharge Summaries using LLMs
Authors: Sanjeet Singh, Shreya Gupta, Niralee Gupta, Naimish Sharma, Lokesh Srivastava, Vibhu Agarwal, Ashutosh Modi,
Abstract summary: Average financial impact of a data breach in recent months has been estimated to be close to USD 10 million. Computer-based systems for de-identification of personal information are vulnerable to data drift.
Score: 3.8895618250348116
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The consequences of a healthcare data breach can be devastating for the patients, providers, and payers. The average financial impact of a data breach in recent months has been estimated to be close to USD 10 million. This is especially significant for healthcare organizations in India that are managing rapid digitization while still establishing data governance procedures that align with the letter and spirit of the law. Computer-based systems for de-identification of personal information are vulnerable to data drift, often rendering them ineffective in cross-institution settings. Therefore, a rigorous assessment of existing de-identification against local health datasets is imperative to support the safe adoption of digital health initiatives in India. Using a small set of de-identified patient discharge summaries provided by an Indian healthcare institution, in this paper, we report the nominal performance of de-identification algorithms (based on language models) trained on publicly available non-Indian datasets, pointing towards a lack of cross-institutional generalization. Similarly, experimentation with off-the-shelf de-identification systems reveals potential risks associated with the approach. To overcome data scarcity, we explore generating synthetic clinical reports (using publicly available and Indian summaries) by performing in-context learning over Large Language Models (LLMs). Our experiments demonstrate the use of generated reports as an effective strategy for creating high-performing de-identification systems with good generalization capabilities.

Related papers

An Analytical Approach to Privacy and Performance Trade-Offs in Healthcare Data Sharing [1.2179548969182572]
Older adults, frequently hospitalized patients, and racial minorities are vulnerable to privacy attacks.<n>We evaluate three anonymization methods-$k$-anonymity, the technique by Zheng et al., and the MO-OBAM model-based on their ability to reduce re-identification risk.
arXiv Detail & Related papers (2025-08-25T21:36:47Z)
Privacy-Aware, Public-Aligned: Embedding Risk Detection and Public Values into Scalable Clinical Text De-Identification for Trusted Research Environments [0.0]
We show how direct and indirect identifiers vary by record type, clinical setting, and data flow, and show how changes in documentation practice can degrade model performance over time.<n>Our findings highlight that privacy risk is context-dependent and cumulative, underscoring the need for adaptable, hybrid de-identification approaches.
arXiv Detail & Related papers (2025-06-01T17:45:57Z)
Adaptable Cardiovascular Disease Risk Prediction from Heterogeneous Data using Large Language Models [70.64969663547703]
AdaCVD is an adaptable CVD risk prediction framework built on large language models extensively fine-tuned on over half a million participants from the UK Biobank.<n>It addresses key clinical challenges across three dimensions: it flexibly incorporates comprehensive yet variable patient information; it seamlessly integrates both structured data and unstructured text; and it rapidly adapts to new patient populations using minimal additional data.
arXiv Detail & Related papers (2025-05-30T14:42:02Z)
A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage [77.83757117924995]
We propose a new framework that evaluates re-identification attacks to quantify individual privacy risks upon data release. Our approach shows that seemingly innocuous auxiliary information can be used to infer sensitive attributes like age or substance use history from sanitized data.
arXiv Detail & Related papers (2025-04-28T01:16:27Z)
Design and Implementation of a Scalable Clinical Data Warehouse for Resource-Constrained Healthcare Systems [0.0]
This study proposes a scalable, privacy-limited clinical data warehouse, NCDW, designed for heterogeneous EHR integration in resource-limited settings. The framework can be adapted to various healthcare settings across developing nations by modifying the ingestion layer to accommodate standards like ICD-11 and HL7 FHIR.
arXiv Detail & Related papers (2025-02-23T18:19:30Z)
MisinfoEval: Generative AI in the Era of "Alternative Facts" [50.069577397751175]
We introduce a framework for generating and evaluating large language model (LLM) based misinformation interventions. We present (1) an experiment with a simulated social media environment to measure effectiveness of misinformation interventions, and (2) a second experiment with personalized explanations tailored to the demographics and beliefs of users. Our findings confirm that LLM-based interventions are highly effective at correcting user behavior.
arXiv Detail & Related papers (2024-10-13T18:16:50Z)
Generating Synthetic Free-text Medical Records with Low Re-identification Risk using Masked Language Modeling [6.193782515824411]
We present a system that generates synthetic free-text medical records using Masked Language Modeling (MLM) Our system is designed to preserve the critical information of the records while introducing significant diversity and minimizing re-identification risk.
arXiv Detail & Related papers (2024-09-15T19:11:01Z)
Explainable Machine Learning-Based Security and Privacy Protection Framework for Internet of Medical Things Systems [1.8434042562191815]
The Internet of Medical Things (IoMT) transcends traditional medical boundaries, enabling a transition from reactive treatment to proactive prevention. Its benefits are countered by significant security challenges that endanger the lives of its users due to the sensitivity and value of the processed data. A new framework for Intrusion Detection Systems (IDS) is introduced, leveraging Artificial Neural Networks (ANN) for intrusion detection while utilizing Federated Learning (FL) for privacy preservation.
arXiv Detail & Related papers (2024-03-14T11:57:26Z)
DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 [80.36535668574804]
We develop a novel GPT4-enabled de-identification framework (DeID-GPT") Our developed DeID-GPT showed the highest accuracy and remarkable reliability in masking private information from the unstructured medical text. This study is one of the earliest to utilize ChatGPT and GPT-4 for medical text data processing and de-identification.
arXiv Detail & Related papers (2023-03-20T11:34:37Z)
Epidemic Management and Control Through Risk-Dependent Individual Contact Interventions [1.1439420412899566]
Testing, contact tracing, and isolation (TTI) is an epidemic management and control approach that is difficult to implement at scale. Here we demonstrate a scalable improvement to TTI and exposure notification apps that uses data assimilation (DA) on a contact network.
arXiv Detail & Related papers (2021-09-22T18:39:10Z)
Clinical Outcome Prediction from Admission Notes using Self-Supervised Knowledge Integration [55.88616573143478]
Outcome prediction from clinical text can prevent doctors from overlooking possible risks. Diagnoses at discharge, procedures performed, in-hospital mortality and length-of-stay prediction are four common outcome prediction targets. We propose clinical outcome pre-training to integrate knowledge about patient outcomes from multiple public sources.
arXiv Detail & Related papers (2021-02-08T10:26:44Z)
UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model. UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data. We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD) UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z)
Epidemic mitigation by statistical inference from contact tracing data [61.04165571425021]
We develop Bayesian inference methods to estimate the risk that an individual is infected. We propose to use probabilistic risk estimation in order to optimize testing and quarantining strategies for the control of an epidemic. Our approaches translate into fully distributed algorithms that only require communication between individuals who have recently been in contact.
arXiv Detail & Related papers (2020-09-20T12:24:45Z)
COVI White Paper [67.04578448931741]
Contact tracing is an essential tool to change the course of the Covid-19 pandemic. We present an overview of the rationale, design, ethical considerations and privacy strategy of COVI,' a Covid-19 public peer-to-peer contact tracing and risk awareness mobile application developed in Canada.
arXiv Detail & Related papers (2020-05-18T07:40:49Z)
Approximate Nearest Neighbour Search on Privacy-aware Encoding of User Locations to Identify Susceptible Infections in Simulated Epidemics [13.55844312718721]
We investigate how effectively and efficiently can a list of susceptible people be found given a list of infected persons and their locations. By using the locations of the given list of infected persons as queries, we investigate the feasibility of applying approximate nearest neighbour (ANN) based indexing and retrieval approaches.
arXiv Detail & Related papers (2020-04-19T13:34:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.