One Loss to Rule Them All: Marked Time-to-Event for Structured EHR Foundation Models
- URL: http://arxiv.org/abs/2602.00541v1
- Date: Sat, 31 Jan 2026 06:15:46 GMT
- Title: One Loss to Rule Them All: Marked Time-to-Event for Structured EHR Foundation Models
- Authors: Zilin Jing, Vincent Jeanselme, Yuta Kobayashi, Simon A. Lee, Chao Pang, Aparajita Kashyap, Yanwei Li, Xinzhuo Jiang, Shalmali Joshi,
- Abstract summary: We propose ORA, a time-to-event pretraining objective that jointly models event timing and associated measurements.<n>Our results suggest a broader takeaway: pretraining objectives that account for EHR structure are critical for expanding downstream capabilities and generalizability.
- Score: 12.630229861635476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Clinical events captured in Electronic Health Records (EHR) are irregularly sampled and may consist of a mixture of discrete events and numerical measurements, such as laboratory values or treatment dosages. The sequential nature of EHR, analogous to natural language, has motivated the use of next-token prediction to train prior EHR Foundation Models (FMs) over events. However, this training fails to capture the full structure of EHR. We propose ORA, a marked time-to-event pretraining objective that jointly models event timing and associated measurements. Across multiple datasets, downstream tasks, and model architectures, this objective consistently yields more generalizable representations than next-token prediction and pretraining losses that ignore continuous measurements. Importantly, the proposed objective yields improvements beyond traditional classification evaluation, including better regression and time-to-event prediction. Beyond introducing a new family of FMs, our results suggest a broader takeaway: pretraining objectives that account for EHR structure are critical for expanding downstream capabilities and generalizability
Related papers
- Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling [51.78972657142583]
We introduce Timer-S1, a strong Mixture-of-Experts (MoE) time series foundation model with 8.3B total parameters, 0.75B activated parameters for each token, and a context length of 11.5K.<n>To overcome the scalability bottleneck in existing pre-trained time series foundation models, we perform Serial Scaling in three dimensions.
arXiv Detail & Related papers (2026-03-05T04:13:57Z) - A Unified Frequency Domain Decomposition Framework for Interpretable and Robust Time Series Forecasting [81.73338008264115]
Current approaches for time series forecasting, whether in the time or frequency domain, predominantly use deep learning models based on linear layers or transformers.<n>We propose FIRE, a unified frequency domain decomposition framework that provides a mathematical abstraction for diverse types of time series.<n>Fire consistently outperforms state-of-the-art models on long-term forecasting benchmarks.
arXiv Detail & Related papers (2025-10-11T09:59:25Z) - Building the EHR Foundation Model via Next Event Prediction [5.378917071184147]
Next Event Prediction (NEP) is a framework that enhances Large Language Models' temporal reasoning.<n>NEP explicitly models disease progression patterns and causal relationships.<n>Our analyses reveal dual benefits: state-of-the-art prediction accuracy combined with clinically interpretable attention patterns.
arXiv Detail & Related papers (2025-09-29T23:27:51Z) - Foundation Models for Clinical Records at Health System Scale [40.88151645546234]
We present a novel generative pretraining strategy for sequential EHR data using next-visit event prediction.<n>Our model learns to autoregressively generate various tokenized clinical events for the next visit based on patient history.
arXiv Detail & Related papers (2025-07-01T08:52:33Z) - Towards Data-Efficient Pretraining for Atomic Property Prediction [51.660835328611626]
We show that pretraining on a task-relevant dataset can match or surpass large-scale pretraining.<n>We introduce the Chemical Similarity Index (CSI), a novel metric inspired by computer vision's Fr'echet Inception Distance.
arXiv Detail & Related papers (2025-02-16T11:46:23Z) - Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training [68.94373533768501]
We model knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training.<n>We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy.
arXiv Detail & Related papers (2025-02-06T13:23:53Z) - Context Clues: Evaluating Long Context Models for Clinical Prediction Tasks on EHRs [20.08838976147805]
We present the first systematic evaluation of the effect of context length on modeling EHR data.<n>We find that longer context models improve predictive performance.<n>For clinical applications, however, model performance alone is insufficient.
arXiv Detail & Related papers (2024-12-09T21:58:27Z) - Evidential time-to-event prediction with calibrated uncertainty quantification [12.446406577462069]
Time-to-event analysis provides insights into clinical prognosis and treatment recommendations.<n>We propose an evidential regression model specifically designed for time-to-event prediction.<n>We show that our model delivers both accurate and reliable performance, outperforming state-of-the-art methods.
arXiv Detail & Related papers (2024-11-12T15:06:04Z) - Impact of Noisy Supervision in Foundation Model Learning [91.56591923244943]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.<n>We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Examining the Effect of Pre-training on Time Series Classification [21.38211396933795]
This study investigates the impact of pre-training followed by fine-tuning on the fine-tuning process.
We conducted a thorough examination of 150 classification datasets.
We find that pre-training can only help improve the optimization process for models that fit the data poorly.
Adding more pre-training data does not improve generalization, but it can strengthen the advantage of pre-training on the original data volume.
arXiv Detail & Related papers (2023-09-11T06:26:57Z) - Towards Out-of-Distribution Sequential Event Prediction: A Causal
Treatment [72.50906475214457]
The goal of sequential event prediction is to estimate the next event based on a sequence of historical events.
In practice, the next-event prediction models are trained with sequential data collected at one time.
We propose a framework with hierarchical branching structures for learning context-specific representations.
arXiv Detail & Related papers (2022-10-24T07:54:13Z) - Multi-axis Attentive Prediction for Sparse EventData: An Application to
Crime Prediction [16.654369376687296]
We present a purely attentional approach to extract both short-term dynamics and long-term semantics of event propagation through two observation angles.
The proposed contrastive learning objective significantly enhances the MAPSED's ability to capture semantics and dynamics of events.
arXiv Detail & Related papers (2021-10-05T02:38:46Z) - SANSformers: Self-Supervised Forecasting in Electronic Health Records
with Attention-Free Models [48.07469930813923]
This work aims to forecast the demand for healthcare services, by predicting the number of patient visits to healthcare facilities.
We introduce SANSformer, an attention-free sequential model designed with specific inductive biases to cater for the unique characteristics of EHR data.
Our results illuminate the promising potential of tailored attention-free models and self-supervised pretraining in refining healthcare utilization predictions across various patient demographics.
arXiv Detail & Related papers (2021-08-31T08:23:56Z) - Improving Event Duration Prediction via Time-aware Pre-training [90.74988936678723]
We introduce two effective models for duration prediction.
One model predicts the range/unit where the duration value falls in (R-pred); and the other predicts the exact duration value E-pred.
Our best model -- E-pred, substantially outperforms previous work, and captures duration information more accurately than R-pred.
arXiv Detail & Related papers (2020-11-05T01:52:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.