Generation and De-Identification of Indian Clinical Discharge Summaries using LLMs
- URL: http://arxiv.org/abs/2407.05887v1
- Date: Mon, 8 Jul 2024 12:47:03 GMT
- Title: Generation and De-Identification of Indian Clinical Discharge Summaries using LLMs
- Authors: Sanjeet Singh, Shreya Gupta, Niralee Gupta, Naimish Sharma, Lokesh Srivastava, Vibhu Agarwal, Ashutosh Modi,
- Abstract summary: Average financial impact of a data breach in recent months has been estimated to be close to USD 10 million.
Computer-based systems for de-identification of personal information are vulnerable to data drift.
- Score: 3.8895618250348116
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The consequences of a healthcare data breach can be devastating for the patients, providers, and payers. The average financial impact of a data breach in recent months has been estimated to be close to USD 10 million. This is especially significant for healthcare organizations in India that are managing rapid digitization while still establishing data governance procedures that align with the letter and spirit of the law. Computer-based systems for de-identification of personal information are vulnerable to data drift, often rendering them ineffective in cross-institution settings. Therefore, a rigorous assessment of existing de-identification against local health datasets is imperative to support the safe adoption of digital health initiatives in India. Using a small set of de-identified patient discharge summaries provided by an Indian healthcare institution, in this paper, we report the nominal performance of de-identification algorithms (based on language models) trained on publicly available non-Indian datasets, pointing towards a lack of cross-institutional generalization. Similarly, experimentation with off-the-shelf de-identification systems reveals potential risks associated with the approach. To overcome data scarcity, we explore generating synthetic clinical reports (using publicly available and Indian summaries) by performing in-context learning over Large Language Models (LLMs). Our experiments demonstrate the use of generated reports as an effective strategy for creating high-performing de-identification systems with good generalization capabilities.
Related papers
- Explainable Machine Learning-Based Security and Privacy Protection Framework for Internet of Medical Things Systems [1.8434042562191815]
The Internet of Medical Things (IoMT) transcends traditional medical boundaries, enabling a transition from reactive treatment to proactive prevention.
Its benefits are countered by significant security challenges that endanger the lives of its users due to the sensitivity and value of the processed data.
A new framework for Intrusion Detection Systems (IDS) is introduced, leveraging Artificial Neural Networks (ANN) for intrusion detection while utilizing Federated Learning (FL) for privacy preservation.
arXiv Detail & Related papers (2024-03-14T11:57:26Z) - DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 [80.36535668574804]
We develop a novel GPT4-enabled de-identification framework (DeID-GPT")
Our developed DeID-GPT showed the highest accuracy and remarkable reliability in masking private information from the unstructured medical text.
This study is one of the earliest to utilize ChatGPT and GPT-4 for medical text data processing and de-identification.
arXiv Detail & Related papers (2023-03-20T11:34:37Z) - Classifying Cyber-Risky Clinical Notes by Employing Natural Language
Processing [9.77063694539068]
Recently, some states within the United States of America require patients to have open access to their clinical notes.
This research investigates methods for identifying security/privacy risks within clinical notes.
arXiv Detail & Related papers (2022-03-24T00:36:59Z) - Adherence Forecasting for Guided Internet-Delivered Cognitive Behavioral
Therapy: A Minimally Data-Sensitive Approach [59.535699822923]
Internet-delivered psychological treatments (IDPT) are seen as an effective and scalable pathway to improving the accessibility of mental healthcare.
This work proposes a deep-learning approach to perform automatic adherence forecasting, while relying on minimally sensitive login/logout data.
The proposed Self-Attention Network achieved over 70% average balanced accuracy, when only 1/3 of the treatment duration had elapsed.
arXiv Detail & Related papers (2022-01-11T13:55:57Z) - Epidemic Management and Control Through Risk-Dependent Individual
Contact Interventions [1.1439420412899566]
Testing, contact tracing, and isolation (TTI) is an epidemic management and control approach that is difficult to implement at scale.
Here we demonstrate a scalable improvement to TTI and exposure notification apps that uses data assimilation (DA) on a contact network.
arXiv Detail & Related papers (2021-09-22T18:39:10Z) - Clinical Outcome Prediction from Admission Notes using Self-Supervised
Knowledge Integration [55.88616573143478]
Outcome prediction from clinical text can prevent doctors from overlooking possible risks.
Diagnoses at discharge, procedures performed, in-hospital mortality and length-of-stay prediction are four common outcome prediction targets.
We propose clinical outcome pre-training to integrate knowledge about patient outcomes from multiple public sources.
arXiv Detail & Related papers (2021-02-08T10:26:44Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z) - Epidemic mitigation by statistical inference from contact tracing data [61.04165571425021]
We develop Bayesian inference methods to estimate the risk that an individual is infected.
We propose to use probabilistic risk estimation in order to optimize testing and quarantining strategies for the control of an epidemic.
Our approaches translate into fully distributed algorithms that only require communication between individuals who have recently been in contact.
arXiv Detail & Related papers (2020-09-20T12:24:45Z) - COVI White Paper [67.04578448931741]
Contact tracing is an essential tool to change the course of the Covid-19 pandemic.
We present an overview of the rationale, design, ethical considerations and privacy strategy of COVI,' a Covid-19 public peer-to-peer contact tracing and risk awareness mobile application developed in Canada.
arXiv Detail & Related papers (2020-05-18T07:40:49Z) - Approximate Nearest Neighbour Search on Privacy-aware Encoding of User
Locations to Identify Susceptible Infections in Simulated Epidemics [13.55844312718721]
We investigate how effectively and efficiently can a list of susceptible people be found given a list of infected persons and their locations.
By using the locations of the given list of infected persons as queries, we investigate the feasibility of applying approximate nearest neighbour (ANN) based indexing and retrieval approaches.
arXiv Detail & Related papers (2020-04-19T13:34:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.