Generation and De-Identification of Indian Clinical Discharge Summaries using LLMs
- URL: http://arxiv.org/abs/2407.05887v1
- Date: Mon, 8 Jul 2024 12:47:03 GMT
- Title: Generation and De-Identification of Indian Clinical Discharge Summaries using LLMs
- Authors: Sanjeet Singh, Shreya Gupta, Niralee Gupta, Naimish Sharma, Lokesh Srivastava, Vibhu Agarwal, Ashutosh Modi,
- Abstract summary: Average financial impact of a data breach in recent months has been estimated to be close to USD 10 million.
Computer-based systems for de-identification of personal information are vulnerable to data drift.
- Score: 3.8895618250348116
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The consequences of a healthcare data breach can be devastating for the patients, providers, and payers. The average financial impact of a data breach in recent months has been estimated to be close to USD 10 million. This is especially significant for healthcare organizations in India that are managing rapid digitization while still establishing data governance procedures that align with the letter and spirit of the law. Computer-based systems for de-identification of personal information are vulnerable to data drift, often rendering them ineffective in cross-institution settings. Therefore, a rigorous assessment of existing de-identification against local health datasets is imperative to support the safe adoption of digital health initiatives in India. Using a small set of de-identified patient discharge summaries provided by an Indian healthcare institution, in this paper, we report the nominal performance of de-identification algorithms (based on language models) trained on publicly available non-Indian datasets, pointing towards a lack of cross-institutional generalization. Similarly, experimentation with off-the-shelf de-identification systems reveals potential risks associated with the approach. To overcome data scarcity, we explore generating synthetic clinical reports (using publicly available and Indian summaries) by performing in-context learning over Large Language Models (LLMs). Our experiments demonstrate the use of generated reports as an effective strategy for creating high-performing de-identification systems with good generalization capabilities.
Related papers
- MisinfoEval: Generative AI in the Era of "Alternative Facts" [50.069577397751175]
We introduce a framework for generating and evaluating large language model (LLM) based misinformation interventions.
We present (1) an experiment with a simulated social media environment to measure effectiveness of misinformation interventions, and (2) a second experiment with personalized explanations tailored to the demographics and beliefs of users.
Our findings confirm that LLM-based interventions are highly effective at correcting user behavior.
arXiv Detail & Related papers (2024-10-13T18:16:50Z) - Generating Synthetic Free-text Medical Records with Low Re-identification Risk using Masked Language Modeling [6.193782515824411]
We present a system that generates synthetic free-text medical records using Masked Language Modeling (MLM)
Our system is designed to preserve the critical information of the records while introducing significant diversity and minimizing re-identification risk.
arXiv Detail & Related papers (2024-09-15T19:11:01Z) - Explainable Machine Learning-Based Security and Privacy Protection Framework for Internet of Medical Things Systems [1.8434042562191815]
The Internet of Medical Things (IoMT) transcends traditional medical boundaries, enabling a transition from reactive treatment to proactive prevention.
Its benefits are countered by significant security challenges that endanger the lives of its users due to the sensitivity and value of the processed data.
A new framework for Intrusion Detection Systems (IDS) is introduced, leveraging Artificial Neural Networks (ANN) for intrusion detection while utilizing Federated Learning (FL) for privacy preservation.
arXiv Detail & Related papers (2024-03-14T11:57:26Z) - DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 [80.36535668574804]
We develop a novel GPT4-enabled de-identification framework (DeID-GPT")
Our developed DeID-GPT showed the highest accuracy and remarkable reliability in masking private information from the unstructured medical text.
This study is one of the earliest to utilize ChatGPT and GPT-4 for medical text data processing and de-identification.
arXiv Detail & Related papers (2023-03-20T11:34:37Z) - Epidemic Management and Control Through Risk-Dependent Individual
Contact Interventions [1.1439420412899566]
Testing, contact tracing, and isolation (TTI) is an epidemic management and control approach that is difficult to implement at scale.
Here we demonstrate a scalable improvement to TTI and exposure notification apps that uses data assimilation (DA) on a contact network.
arXiv Detail & Related papers (2021-09-22T18:39:10Z) - Clinical Outcome Prediction from Admission Notes using Self-Supervised
Knowledge Integration [55.88616573143478]
Outcome prediction from clinical text can prevent doctors from overlooking possible risks.
Diagnoses at discharge, procedures performed, in-hospital mortality and length-of-stay prediction are four common outcome prediction targets.
We propose clinical outcome pre-training to integrate knowledge about patient outcomes from multiple public sources.
arXiv Detail & Related papers (2021-02-08T10:26:44Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z) - Epidemic mitigation by statistical inference from contact tracing data [61.04165571425021]
We develop Bayesian inference methods to estimate the risk that an individual is infected.
We propose to use probabilistic risk estimation in order to optimize testing and quarantining strategies for the control of an epidemic.
Our approaches translate into fully distributed algorithms that only require communication between individuals who have recently been in contact.
arXiv Detail & Related papers (2020-09-20T12:24:45Z) - COVI White Paper [67.04578448931741]
Contact tracing is an essential tool to change the course of the Covid-19 pandemic.
We present an overview of the rationale, design, ethical considerations and privacy strategy of COVI,' a Covid-19 public peer-to-peer contact tracing and risk awareness mobile application developed in Canada.
arXiv Detail & Related papers (2020-05-18T07:40:49Z) - Approximate Nearest Neighbour Search on Privacy-aware Encoding of User
Locations to Identify Susceptible Infections in Simulated Epidemics [13.55844312718721]
We investigate how effectively and efficiently can a list of susceptible people be found given a list of infected persons and their locations.
By using the locations of the given list of infected persons as queries, we investigate the feasibility of applying approximate nearest neighbour (ANN) based indexing and retrieval approaches.
arXiv Detail & Related papers (2020-04-19T13:34:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.