DualAlign: Generating Clinically Grounded Synthetic Data
- URL: http://arxiv.org/abs/2509.10538v1
- Date: Fri, 05 Sep 2025 18:04:38 GMT
- Title: DualAlign: Generating Clinically Grounded Synthetic Data
- Authors: Rumeng Li, Xun Wang, Hong Yu,
- Abstract summary: We introduce DualAlign, a framework that enhances statistical fidelity and clinical plausibility through dual alignment.<n>Using Alzheimer's disease (AD) as a case study, DualAlign produces context-grounded symptom-level sentences.<n>Fine-tuning an LLaMA 3.1-8B model with a combination of DualAlign-generated and human-annotated data yields substantial performance gains.
- Score: 9.87164447021602
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic clinical data are increasingly important for advancing AI in healthcare, given strict privacy constraints on real-world EHRs, limited availability of annotated rare-condition data, and systemic biases in observational datasets. While large language models (LLMs) can generate fluent clinical text, producing synthetic data that is both realistic and clinically meaningful remains challenging. We introduce DualAlign, a framework that enhances statistical fidelity and clinical plausibility through dual alignment: (1) statistical alignment, which conditions generation on patient demographics and risk factors; and (2) semantic alignment, which incorporates real-world symptom trajectories to guide content generation. Using Alzheimer's disease (AD) as a case study, DualAlign produces context-grounded symptom-level sentences that better reflect real-world clinical documentation. Fine-tuning an LLaMA 3.1-8B model with a combination of DualAlign-generated and human-annotated data yields substantial performance gains over models trained on gold data alone or unguided synthetic baselines. While DualAlign does not fully capture longitudinal complexity, it offers a practical approach for generating clinically grounded, privacy-preserving synthetic data to support low-resource clinical text analysis.
Related papers
- A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine [59.78991974851707]
Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis.<n>Most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems.<n>We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications.
arXiv Detail & Related papers (2026-01-29T18:48:21Z) - A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis [82.01597026329158]
We introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS) for pathology-specific text-to-image synthesis.<n>CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy.<n>This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations.
arXiv Detail & Related papers (2025-12-15T10:22:43Z) - Integrating Genomics into Multimodal EHR Foundation Models [56.31910745104141]
This paper introduces an innovative EHR foundation model that integrates Polygenic Risk Scores (PRS) as a foundational data modality.<n>The framework aims to learn complex relationships between clinical data and genetic predispositions.<n>This approach is pivotal for unlocking new insights into disease prediction, proactive health management, risk stratification, and personalized treatment strategies.
arXiv Detail & Related papers (2025-10-24T15:56:40Z) - Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference [89.5628648718851]
Causal inference is essential for developing and evaluating medical interventions.<n>Real-world medical datasets are often difficult to access due to regulatory barriers.<n>We present STEAM: a novel method for generating Synthetic data for Treatment Effect Analysis in Medicine.
arXiv Detail & Related papers (2025-10-21T16:16:00Z) - SynLLM: A Comparative Analysis of Large Language Models for Medical Tabular Synthetic Data Generation via Prompt Engineering [1.5020330976600738]
We present SynLLM, a modular framework for generating high-quality synthetic medical data using open-source Large Language Models.<n>We evaluate SynLLM across three public medical datasets, including Diabetes, Cirrhosis, and Stroke.<n>Our results show that prompt engineering significantly impacts data quality and privacy risk, with rule-based prompts achieving the best privacy-quality balance.
arXiv Detail & Related papers (2025-08-11T23:56:42Z) - Adaptable Cardiovascular Disease Risk Prediction from Heterogeneous Data using Large Language Models [70.64969663547703]
AdaCVD is an adaptable CVD risk prediction framework built on large language models extensively fine-tuned on over half a million participants from the UK Biobank.<n>It addresses key clinical challenges across three dimensions: it flexibly incorporates comprehensive yet variable patient information; it seamlessly integrates both structured data and unstructured text; and it rapidly adapts to new patient populations using minimal additional data.
arXiv Detail & Related papers (2025-05-30T14:42:02Z) - How Real Are Synthetic Therapy Conversations? Evaluating Fidelity in Prolonged Exposure Dialogues [14.457387337806765]
Synthetic data adoption in healthcare is driven by privacy concerns, data access limitations, and high annotation costs.<n>We explore synthetic Prolonged Exposure (PE) therapy conversations for PTSD as a scalable alternative for training clinical models.<n>We systematically compare real and synthetic dialogues using linguistic, structural, and protocol-specific metrics like turn-taking and treatment fidelity.
arXiv Detail & Related papers (2025-04-30T16:56:56Z) - Generating Clinically Realistic EHR Data via a Hierarchy- and Semantics-Guided Transformer [0.0]
We propose the Hierarchy- and Semantics-Guided Transformer (HiSGT), a novel framework for the generative process.<n>HiSGT constructs a hierarchical graph to encode parent-child and sibling relationships among clinical codes and employs a graph neural network to derive hierarchy-aware embeddings.<n>Experiments on the MIMIC-III and MIMIC-IV datasets demonstrate that HiSGT significantly improves the statistical alignment of synthetic data with real patient records.
arXiv Detail & Related papers (2025-02-28T05:06:04Z) - Synthetic Data Generation with LLM for Improved Depression Prediction [5.508617844957542]
We propose a pipeline for Large Language Models to generate synthetic data to improve the performance of depression prediction models.
Not only was the synthetic data satisfactory in terms of fidelity and privacy-preserving metrics, it also balanced the distribution of severity in the training dataset.
arXiv Detail & Related papers (2024-11-26T18:31:14Z) - Masked Clinical Modelling: A Framework for Synthetic and Augmented Survival Data Generation [1.7769033811751995]
We present Masked Clinical Modelling (MCM), a framework inspired by masked language modelling.
MCM is designed for both data synthesis and conditional data augmentation.
We evaluate this prototype on the WHAS500 dataset using Cox Proportional Hazards models.
arXiv Detail & Related papers (2024-10-22T08:38:46Z) - Reliability in Semantic Segmentation: Can We Use Synthetic Data? [69.28268603137546]
We show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models.
This synthetic data is employed to evaluate the robustness of pretrained segmenters.
We demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters.
arXiv Detail & Related papers (2023-12-14T18:56:07Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Synthesize High-dimensional Longitudinal Electronic Health Records via
Hierarchical Autoregressive Language Model [40.473866438962034]
Synthetic electronic health records can serve as an alternative to real EHRs for machine learning (ML) modeling and statistical analysis.
We propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal high-dimensional EHR.
arXiv Detail & Related papers (2023-04-04T23:53:34Z) - Delving into High-Quality Synthetic Face Occlusion Segmentation Datasets [83.749895930242]
We propose two techniques for producing high-quality naturalistic synthetic occluded faces.
We empirically show the effectiveness and robustness of both methods, even for unseen occlusions.
We present two high-resolution real-world occluded face datasets with fine-grained annotations, RealOcc and RealOcc-Wild.
arXiv Detail & Related papers (2022-05-12T17:03:57Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.