NLICE: Synthetic Medical Record Generation for Effective Primary
Healthcare Differential Diagnosis
- URL: http://arxiv.org/abs/2401.13756v1
- Date: Wed, 24 Jan 2024 19:17:45 GMT
- Title: NLICE: Synthetic Medical Record Generation for Effective Primary
Healthcare Differential Diagnosis
- Authors: Zaid Al-Ars, Obinna Agba, Zhuoran Guo, Christiaan Boerkamp, Ziyaad
Jaber, Tareq Jaber
- Abstract summary: We use a public disease-symptom data source called SymCat to construct the patients records.
In order to increase the expressive nature of the synthetic data, we use a medically-standardized symptom modeling method called NLICE.
We show results for the effectiveness of using the datasets to train predictive disease models.
- Score: 0.765458997723296
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper offers a systematic method for creating medical knowledge-grounded
patient records for use in activities involving differential diagnosis.
Additionally, an assessment of machine learning models that can differentiate
between various conditions based on given symptoms is also provided. We use a
public disease-symptom data source called SymCat in combination with Synthea to
construct the patients records. In order to increase the expressive nature of
the synthetic data, we use a medically-standardized symptom modeling method
called NLICE to augment the synthetic data with additional contextual
information for each condition. In addition, Naive Bayes and Random Forest
models are evaluated and compared on the synthetic data. The paper shows how to
successfully construct SymCat-based and NLICE-based datasets. We also show
results for the effectiveness of using the datasets to train predictive disease
models. The SymCat-based dataset is able to train a Naive Bayes and Random
Forest model yielding a 58.8% and 57.1% Top-1 accuracy score, respectively. In
contrast, the NLICE-based dataset improves the results, with a Top-1 accuracy
of 82.0% and Top-5 accuracy values of more than 90% for both models. Our
proposed data generation approach solves a major barrier to the application of
artificial intelligence methods in the healthcare domain. Our novel NLICE
symptom modeling approach addresses the incomplete and insufficient information
problem in the current binary symptom representation approach. The NLICE code
is open sourced at https://github.com/guozhuoran918/NLICE.
Related papers
- Utilizing Large Language Models to Generate Synthetic Data to Increase the Performance of BERT-Based Neural Networks [0.7071166713283337]
We created datasets large enough to train machine learning models.
Our goal is to label behaviors corresponding to autism criteria.
Augmenting data increased recall by 13% but decreased precision by 16%.
arXiv Detail & Related papers (2024-05-08T03:18:12Z) - Incorporating Improved Sinusoidal Threshold-based Semi-supervised Method
and Diffusion Models for Osteoporosis Diagnosis [0.43512163406552007]
Osteoporosis is a common skeletal disease that seriously affects patients' quality of life.
Traditional osteoporosis diagnosis methods are expensive and complex.
This paper can automatically diagnose osteoporosis based on patient's imaging data, which has the advantages of convenience, accuracy, and low cost.
arXiv Detail & Related papers (2024-03-11T08:11:46Z) - How Good Are Synthetic Medical Images? An Empirical Study with Lung
Ultrasound [0.3312417881789094]
Adding synthetic training data using generative models offers a low-cost method to deal with the data scarcity challenge.
We show that training with both synthetic and real data outperforms training with real data alone.
arXiv Detail & Related papers (2023-10-05T15:42:53Z) - The effect of data augmentation and 3D-CNN depth on Alzheimer's Disease
detection [51.697248252191265]
This work summarizes and strictly observes best practices regarding data handling, experimental design, and model evaluation.
We focus on Alzheimer's Disease (AD) detection, which serves as a paradigmatic example of challenging problem in healthcare.
Within this framework, we train predictive 15 models, considering three different data augmentation strategies and five distinct 3D CNN architectures.
arXiv Detail & Related papers (2023-09-13T10:40:41Z) - Large Language Models to Identify Social Determinants of Health in
Electronic Health Records [2.168737004368243]
Social determinants of health (SDoH) have an important impact on patient outcomes but are incompletely collected from the electronic health records (EHRs)
This study researched the ability of large language models to extract SDoH from free text in EHRs, where they are most commonly documented.
800 patient notes were annotated for SDoH categories, and several transformer-based models were evaluated.
arXiv Detail & Related papers (2023-08-11T19:18:35Z) - Exploring the Effectiveness of Dataset Synthesis: An application of
Apple Detection in Orchards [68.95806641664713]
We explore the usability of Stable Diffusion 2.1-base for generating synthetic datasets of apple trees for object detection.
We train a YOLOv5m object detection model to predict apples in a real-world apple detection dataset.
Results demonstrate that the model trained on generated data is slightly underperforming compared to a baseline model trained on real-world images.
arXiv Detail & Related papers (2023-06-20T09:46:01Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Improving the Level of Autism Discrimination through GraphRNN Link
Prediction [8.103074928419527]
This paper is based on the latter technique, which learns the edge distribution of real brain network through GraphRNN.
The experimental results show that the combination of original and synthetic data greatly improves the discrimination of the neural network.
arXiv Detail & Related papers (2022-02-19T06:50:32Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z) - Predicting Clinical Diagnosis from Patients Electronic Health Records
Using BERT-based Neural Networks [62.9447303059342]
We show the importance of this problem in medical community.
We present a modification of Bidirectional Representations from Transformers (BERT) model for classification sequence.
We use a large-scale Russian EHR dataset consisting of about 4 million unique patient visits.
arXiv Detail & Related papers (2020-07-15T09:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.