Scaling Arabic Medical Chatbots Using Synthetic Data: Enhancing Generative AI with Synthetic Patient Records
- URL: http://arxiv.org/abs/2509.10108v1
- Date: Fri, 12 Sep 2025 09:58:11 GMT
- Title: Scaling Arabic Medical Chatbots Using Synthetic Data: Enhancing Generative AI with Synthetic Patient Records
- Authors: Abdulrahman Allam, Seif Ahmed, Ali Hamdi, Khaled Shaban,
- Abstract summary: We propose a scalable synthetic data augmentation strategy to expand the training corpus to 100,000 records.<n>We generated 80,000 contextually relevant and medically coherent synthetic question-answer pairs grounded in the structure of the original dataset.
- Score: 0.4666493857924357
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The development of medical chatbots in Arabic is significantly constrained by the scarcity of large-scale, high-quality annotated datasets. While prior efforts compiled a dataset of 20,000 Arabic patient-doctor interactions from social media to fine-tune large language models (LLMs), model scalability and generalization remained limited. In this study, we propose a scalable synthetic data augmentation strategy to expand the training corpus to 100,000 records. Using advanced generative AI systems ChatGPT-4o and Gemini 2.5 Pro we generated 80,000 contextually relevant and medically coherent synthetic question-answer pairs grounded in the structure of the original dataset. These synthetic samples were semantically filtered, manually validated, and integrated into the training pipeline. We fine-tuned five LLMs, including Mistral-7B and AraGPT2, and evaluated their performance using BERTScore metrics and expert-driven qualitative assessments. To further analyze the effectiveness of synthetic sources, we conducted an ablation study comparing ChatGPT-4o and Gemini-generated data independently. The results showed that ChatGPT-4o data consistently led to higher F1-scores and fewer hallucinations across all models. Overall, our findings demonstrate the viability of synthetic augmentation as a practical solution for enhancing domain-specific language models in-low resource medical NLP, paving the way for more inclusive, scalable, and accurate Arabic healthcare chatbot systems.
Related papers
- Understanding the Influence of Synthetic Data for Text Embedders [52.04771455432998]
We first reproduce and publicly release the synthetic data proposed by Wang et al.<n>We critically examine where exactly synthetic data improves model generalization.<n>Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders.
arXiv Detail & Related papers (2025-09-07T19:28:52Z) - Improving Performance, Robustness, and Fairness of Radiographic AI Models with Finely-Controllable Synthetic Data [14.791557943114737]
RoentGen-v2 is a text-to-image diffusion model for chest radiographs.<n>It generates clinically plausible images with demographic conditioning.<n>We use this large synthetic dataset to evaluate optimal training pipelines for downstream disease classification models.
arXiv Detail & Related papers (2025-08-22T20:30:58Z) - Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.<n>Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z) - An Integrated Approach to AI-Generated Content in e-health [0.0]
We propose an end-to-end class-conditioned framework to generate synthetic medical images and text data.<n>Our framework integrates Diffusion and Large Language Models (LLMs) to generate data that closely match real-world patterns.
arXiv Detail & Related papers (2025-01-18T14:35:29Z) - Merging synthetic and real embryo data for advanced AI predictions [69.07284335967019]
We train two generative models using two datasets-one we created and made publicly available, and one existing public dataset-to generate synthetic embryo images at various cell stages.<n>These were combined with real images to train classification models for embryo cell stage prediction.<n>Our results demonstrate that incorporating synthetic images alongside real data improved classification performance, with the model achieving 97% accuracy compared to 94.5% when trained solely on real data.
arXiv Detail & Related papers (2024-12-02T08:24:49Z) - Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data? [8.775988650381397]
Training medical vision-language pre-training models requires datasets with paired, high-quality image-text data.<n>Recent advancements in Large Language Models (LLMs) and diffusion models have made it possible to generate large-scale synthetic image-text pairs.<n>We use off-the-shelf generative models to create synthetic radiology reports and paired Chest X-ray (CXR) images, and propose an automated pipeline to build a diverse, high-quality synthetic dataset.
arXiv Detail & Related papers (2024-10-17T13:11:07Z) - MedSyn: LLM-based Synthetic Medical Text Generation Framework [0.27376226833693]
We introduce MedSyn, a novel medical text generation framework that integrates large language models with a Medical Knowledge Graph.
We use MKG to sample prior medical information for the prompt and generate synthetic clinical notes with GPT-4 and fine-tuned LLaMA models.
Our research indicates that synthetic data can increase the classification accuracy of vital and challenging codes by up to 17.8% compared to settings without synthetic data.
arXiv Detail & Related papers (2024-08-04T15:07:44Z) - Utilizing Large Language Models to Generate Synthetic Data to Increase the Performance of BERT-Based Neural Networks [0.7071166713283337]
We created datasets large enough to train machine learning models.
Our goal is to label behaviors corresponding to autism criteria.
Augmenting data increased recall by 13% but decreased precision by 16%.
arXiv Detail & Related papers (2024-05-08T03:18:12Z) - TarGEN: Targeted Data Generation with Large Language Models [51.87504111286201]
TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets.
We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances.
A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
arXiv Detail & Related papers (2023-10-27T03:32:17Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.