Limits of Generative Pre-Training in Structured EMR Trajectories with Irregular Sampling
- URL: http://arxiv.org/abs/2510.22878v1
- Date: Mon, 27 Oct 2025 00:04:17 GMT
- Title: Limits of Generative Pre-Training in Structured EMR Trajectories with Irregular Sampling
- Authors: Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm,
- Abstract summary: Foundation models refer to architectures trained on vast datasets using autoregressive pre-training to capture intricate patterns and motifs.<n>We trained two autoregressive models -- a sequence-to-sequence LSTM and a reduced Transformer -- on longitudinal ART for HIV and Acute Hypotension datasets.<n> Controlled irregularity was added during training via random inter-visit gaps, while test sequences stayed complete.<n>Both reproduced feature distributions but failed to preserve cross-feature structure.
- Score: 0.7537475180985093
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Foundation models refer to architectures trained on vast datasets using autoregressive pre-training from natural language processing to capture intricate patterns and motifs. They were originally developed to transfer such learned knowledge to downstream predictive tasks. Recently, however, some studies repurpose these learned representations for phenotype discovery without rigorous validation, risking superficially realistic but clinically incoherent embeddings. To test this mismatch, we trained two autoregressive models -- a sequence-to-sequence LSTM and a reduced Transformer -- on longitudinal ART for HIV and Acute Hypotension datasets. Controlled irregularity was added during training via random inter-visit gaps, while test sequences stayed complete. Patient-trajectory synthesis evaluated distributional and correlational fidelity. Both reproduced feature distributions but failed to preserve cross-feature structure -- showing that generative pre-training yields local realism but limited clinical coherence. These results highlight the need for domain-specific evaluation and support trajectory synthesis as a practical probe before fine-tuning or deployment.
Related papers
- The Stationarity Bias: Stratified Stress-Testing for Time-Series Imputation in Regulated Dynamical Systems [0.098314893665023]
Time-series imputation benchmarks use random masking and shape-agnostic metrics.<n>We formalize this bias and propose a emphStratified Stress-Test that partitions evaluation into Stationary and Transient regimes.
arXiv Detail & Related papers (2026-02-17T15:05:56Z) - Beyond Observations: Reconstruction Error-Guided Irregularly Sampled Time Series Representation Learning [38.869433924831156]
iTimER is a self-supervised framework for ISTS representation learning.<n>It transforms unobserved timestamps into noise-aware training targets, enabling meaningful reconstruction signals.<n>iTimER consistently outperforms state-of-the-art methods under the ISTS setting.
arXiv Detail & Related papers (2025-11-10T08:53:10Z) - rETF-semiSL: Semi-Supervised Learning for Neural Collapse in Temporal Data [44.17657834678967]
We propose a novel semi-supervised pre-training strategy to enforce latent representations that satisfy the Neural Collapse phenomenon.<n>We show that our method significantly outperforms previous pretext tasks when applied to LSTMs, transformers, and state-space models.
arXiv Detail & Related papers (2025-08-13T19:16:47Z) - Diffusion with a Linguistic Compass: Steering the Generation of Clinically Plausible Future sMRI Representations for Early MCI Conversion Prediction [13.937881108738042]
We propose a diffusion-based framework that synthesizes clinically plausible future sMRI representations directly from baseline data.<n>Experiments on ADNI and AIBL cohorts show that MCI-Diff outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-06-05T07:01:05Z) - Neural Network Reprogrammability: A Unified Theme on Model Reprogramming, Prompt Tuning, and Prompt Instruction [57.19302613163439]
We introduce neural network reprogrammability as a unifying framework for model adaptation.<n>We present a taxonomy that categorizes such information manipulation approaches across four key dimensions.<n>We also analyze remaining technical challenges and ethical considerations.
arXiv Detail & Related papers (2025-06-05T05:42:27Z) - Memorization and Regularization in Generative Diffusion Models [5.128303432235475]
Diffusion models have emerged as a powerful framework for generative modeling.<n>The analysis highlights the need for regularization to avoid reproducing the analytically tractable minimizer.<n>Experiments are evaluated in the context of memorization, and directions for future development of regularization are highlighted.
arXiv Detail & Related papers (2025-01-27T05:17:06Z) - Time-series Generation by Contrastive Imitation [87.51882102248395]
We study a generative framework that seeks to combine the strengths of both: Motivated by a moment-matching objective to mitigate compounding error, we optimize a local (but forward-looking) transition policy.
At inference, the learned policy serves as the generator for iterative sampling, and the learned energy serves as a trajectory-level measure for evaluating sample quality.
arXiv Detail & Related papers (2023-11-02T16:45:25Z) - From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition [64.59093444558549]
We propose a simple, easy-to-implement, two-step training pipeline that we call From Fake to Real.
By training on real and synthetic data separately, FFR does not expose the model to the statistical differences between real and synthetic data.
Our experiments show that FFR improves worst group accuracy over the state-of-the-art by up to 20% over three datasets.
arXiv Detail & Related papers (2023-08-08T19:52:28Z) - T-Phenotype: Discovering Phenotypes of Predictive Temporal Patterns in
Disease Progression [82.85825388788567]
We develop a novel temporal clustering method, T-Phenotype, to discover phenotypes of predictive temporal patterns from labeled time-series data.
We show that T-Phenotype achieves the best phenotype discovery performance over all the evaluated baselines.
arXiv Detail & Related papers (2023-02-24T13:30:35Z) - Score-based Causal Representation Learning with Interventions [54.735484409244386]
This paper studies the causal representation learning problem when latent causal variables are observed indirectly.
The objectives are: (i) recovering the unknown linear transformation (up to scaling) and (ii) determining the directed acyclic graph (DAG) underlying the latent variables.
arXiv Detail & Related papers (2023-01-19T18:39:48Z) - Interpretable Additive Recurrent Neural Networks For Multivariate
Clinical Time Series [4.125698836261585]
We present the Interpretable-RNN (I-RNN) that balances model complexity and accuracy by forcing the relationship between variables in the model to be additive.
I-RNN specifically captures the unique characteristics of clinical time series, which are unevenly sampled in time, asynchronously acquired, and have missing data.
We evaluate the I-RNN model on the Physionet 2012 Challenge dataset to predict in-hospital mortality, and on a real-world clinical decision support task: predicting hemodynamic interventions in the intensive care unit.
arXiv Detail & Related papers (2021-09-15T22:30:19Z) - Mode recovery in neural autoregressive sequence modeling [55.05526174291747]
Recent studies have revealed unexpected and undesirable properties of neural autoregressive sequence models.
We investigate how the modes, or local maxima, of a distribution are maintained throughout the full learning chain.
We conclude that future research must consider the entire learning chain in order to fully understand the potentials and perils.
arXiv Detail & Related papers (2021-06-10T02:17:28Z) - Data-driven generation of plausible tissue geometries for realistic
photoacoustic image synthesis [53.65837038435433]
Photoacoustic tomography (PAT) has the potential to recover morphological and functional tissue properties.
We propose a novel approach to PAT data simulation, which we refer to as "learning to simulate"
We leverage the concept of Generative Adversarial Networks (GANs) trained on semantically annotated medical imaging data to generate plausible tissue geometries.
arXiv Detail & Related papers (2021-03-29T11:30:18Z) - Automatic Recall Machines: Internal Replay, Continual Learning and the
Brain [104.38824285741248]
Replay in neural networks involves training on sequential data with memorized samples, which counteracts forgetting of previous behavior caused by non-stationarity.
We present a method where these auxiliary samples are generated on the fly, given only the model that is being trained for the assessed objective.
Instead the implicit memory of learned samples within the assessed model itself is exploited.
arXiv Detail & Related papers (2020-06-22T15:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.