Sensitive Data Detection with High-Throughput Machine Learning Models in
Electrical Health Records
- URL: http://arxiv.org/abs/2305.03169v2
- Date: Mon, 22 May 2023 00:07:33 GMT
- Title: Sensitive Data Detection with High-Throughput Machine Learning Models in
Electrical Health Records
- Authors: Kai Zhang and Xiaoqian Jiang
- Abstract summary: The Health Insurance Portability and Accountability Act of 1996 (HIPAA) is a federal law designed to protect sensitive health information (PHI)
One of the challenges in this area of research is the heterogeneous nature of PHI fields in data across different parties.
This variability makes rule-based sensitive variable identification systems that work on one database fail on another.
- Score: 15.982220037507169
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the era of big data, there is an increasing need for healthcare providers,
communities, and researchers to share data and collaborate to improve health
outcomes, generate valuable insights, and advance research. The Health
Insurance Portability and Accountability Act of 1996 (HIPAA) is a federal law
designed to protect sensitive health information by defining regulations for
protected health information (PHI). However, it does not provide efficient
tools for detecting or removing PHI before data sharing. One of the challenges
in this area of research is the heterogeneous nature of PHI fields in data
across different parties. This variability makes rule-based sensitive variable
identification systems that work on one database fail on another. To address
this issue, our paper explores the use of machine learning algorithms to
identify sensitive variables in structured data, thus facilitating the
de-identification process. We made a key observation that the distributions of
metadata of PHI fields and non-PHI fields are very different. Based on this
novel finding, we engineered over 30 features from the metadata of the original
features and used machine learning to build classification models to
automatically identify PHI fields in structured Electronic Health Record (EHR)
data. We trained the model on a variety of large EHR databases from different
data sources and found that our algorithm achieves 99% accuracy when detecting
PHI-related fields for unseen datasets. The implications of our study are
significant and can benefit industries that handle sensitive data.
Related papers
- Robust Privacy Amidst Innovation with Large Language Models Through a Critical Assessment of the Risks [7.928574214440075]
This study examines integrating EHRs and NLP with large language models (LLMs) to improve healthcare data management and patient care.
It focuses on using advanced models to create secure, HIPAA-compliant synthetic patient notes for biomedical research.
arXiv Detail & Related papers (2024-07-23T04:20:14Z) - An advanced data fabric architecture leveraging homomorphic encryption
and federated learning [10.779491433438144]
This paper introduces a secure approach for medical image analysis using federated learning and partially homomorphic encryption within a distributed data fabric architecture.
The study demonstrates the method's effectiveness through a case study on pituitary tumor classification, achieving a significant level of accuracy.
arXiv Detail & Related papers (2024-02-15T08:50:36Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Diversity-enhancing Generative Network for Few-shot Hypothesis
Adaptation [135.80439360370556]
We propose a diversity-enhancing generative network (DEG-Net) for the FHA problem.
It can generate diverse unlabeled data with the help of a kernel independence measure: the Hilbert-Schmidt independence criterion (HSIC)
arXiv Detail & Related papers (2023-07-12T06:29:02Z) - Rethinking Data Heterogeneity in Federated Learning: Introducing a New
Notion and Standard Benchmarks [65.34113135080105]
We show that not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants.
Our observations are intuitive.
Our code is available at https://github.com/MMorafah/FL-SC-NIID.
arXiv Detail & Related papers (2022-09-30T17:15:19Z) - When Accuracy Meets Privacy: Two-Stage Federated Transfer Learning
Framework in Classification of Medical Images on Limited Data: A COVID-19
Case Study [77.34726150561087]
COVID-19 pandemic has spread rapidly and caused a shortage of global medical resources.
CNN has been widely utilized and verified in analyzing medical images.
arXiv Detail & Related papers (2022-03-24T02:09:41Z) - Benchmarking Modern Named Entity Recognition Techniques for Free-text
Health Record De-identification [6.026640792312181]
Federal law restricts the sharing of any EHR data that contains protected health information (PHI)
This project explores several deep learning-based named entity recognition (NER) methods to determine which method(s) perform better on the de-identification task.
We trained and tested our models on the i2b2 training dataset, and qualitatively assessed their performance using EHR data collected from a local hospital.
arXiv Detail & Related papers (2021-03-25T01:26:58Z) - Handling Non-ignorably Missing Features in Electronic Health Records
Data Using Importance-Weighted Autoencoders [8.518166245293703]
We propose a novel extension of VAEs called Importance-Weighted Autoencoders (IWAEs) to flexibly handle Missing Not At Random patterns in the Physionet data.
Our proposed method models the missingness mechanism using an embedded neural network, eliminating the need to specify the exact form of the missingness mechanism a priori.
arXiv Detail & Related papers (2021-01-18T22:53:29Z) - Uncovering the structure of clinical EEG signals with self-supervised
learning [64.4754948595556]
Supervised learning paradigms are often limited by the amount of labeled data that is available.
This phenomenon is particularly problematic in clinically-relevant data, such as electroencephalography (EEG)
By extracting information from unlabeled data, it might be possible to reach competitive performance with deep neural networks.
arXiv Detail & Related papers (2020-07-31T14:34:47Z) - GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially
Private Generators [74.16405337436213]
We propose Gradient-sanitized Wasserstein Generative Adrial Networks (GS-WGAN)
GS-WGAN allows releasing a sanitized form of sensitive data with rigorous privacy guarantees.
We find our approach consistently outperforms state-of-the-art approaches across multiple metrics.
arXiv Detail & Related papers (2020-06-15T10:01:01Z) - Generation of Differentially Private Heterogeneous Electronic Health
Records [9.926231893220061]
We explore using Generative Adversarial Networks to generate synthetic, heterogeneous EHRs.
We will explore applying differential privacy (DP) preserving optimization in order to produce DP synthetic EHR data sets.
arXiv Detail & Related papers (2020-06-05T13:21:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.