Serialized EHR make for good text representations
- URL: http://arxiv.org/abs/2510.13843v1
- Date: Sat, 11 Oct 2025 17:16:15 GMT
- Title: Serialized EHR make for good text representations
- Authors: Zhirong Chou, Quan Qin, Shi Li,
- Abstract summary: SerialBEHRT is a domain aligned foundation model that extends SciBERT through additional pretraining on structured EHR sequences.<n>We evaluate its effectiveness on the task of antibiotic susceptibility prediction, a clinically meaningful problem in antibiotic stewardship.
- Score: 1.585843510099207
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The emergence of foundation models in healthcare has opened new avenues for learning generalizable representations from large scale clinical data. Yet, existing approaches often struggle to reconcile the tabular and event based nature of Electronic Health Records (EHRs) with the sequential priors of natural language models. This structural mismatch limits their ability to capture longitudinal dependencies across patient encounters. We introduce SerialBEHRT, a domain aligned foundation model that extends SciBERT through additional pretraining on structured EHR sequences. SerialBEHRT is designed to encode temporal and contextual relationships among clinical events, thereby producing richer patient representations. We evaluate its effectiveness on the task of antibiotic susceptibility prediction, a clinically meaningful problem in antibiotic stewardship. Through extensive benchmarking against state of the art EHR representation strategies, we demonstrate that SerialBEHRT achieves superior and more consistent performance, highlighting the importance of temporal serialization in foundation model pretraining for healthcare.
Related papers
- Integrating Genomics into Multimodal EHR Foundation Models [56.31910745104141]
This paper introduces an innovative EHR foundation model that integrates Polygenic Risk Scores (PRS) as a foundational data modality.<n>The framework aims to learn complex relationships between clinical data and genetic predispositions.<n>This approach is pivotal for unlocking new insights into disease prediction, proactive health management, risk stratification, and personalized treatment strategies.
arXiv Detail & Related papers (2025-10-24T15:56:40Z) - Building the EHR Foundation Model via Next Event Prediction [5.378917071184147]
Next Event Prediction (NEP) is a framework that enhances Large Language Models' temporal reasoning.<n>NEP explicitly models disease progression patterns and causal relationships.<n>Our analyses reveal dual benefits: state-of-the-art prediction accuracy combined with clinically interpretable attention patterns.
arXiv Detail & Related papers (2025-09-29T23:27:51Z) - CEHR-XGPT: A Scalable Multi-Task Foundation Model for Electronic Health Records [9.583050730170557]
CEHR-XGPT is a general-purpose foundation model for EHR data.<n>It unifies three essential capabilities - feature representation, zero-shot prediction, and synthetic data generation.<n>It demonstrates strong performance across all three tasks and generalizes effectively to external datasets.
arXiv Detail & Related papers (2025-09-03T18:50:03Z) - Time-Aware Attention for Enhanced Electronic Health Records Modeling [8.4225455796455]
TALE-EHR is a Transformer-based framework featuring a novel time-aware attention mechanism that explicitly models continuous temporal gaps.<n>Our approach outperforms state-of-the-art baselines on tasks such as disease progression forecasting.
arXiv Detail & Related papers (2025-07-20T07:32:41Z) - From EHRs to Patient Pathways: Scalable Modeling of Longitudinal Health Trajectories with LLMs [38.49879425944787]
We propose a novel approach to patient pathway modeling by transforming diverse electronic health record (EHR) data into a structured representation.<n>We introduce a novel summary mechanism that embeds long-term temporal context into topic-specific summary tokens, improving performance over text-only models.
arXiv Detail & Related papers (2025-06-05T09:54:01Z) - Adaptable Cardiovascular Disease Risk Prediction from Heterogeneous Data using Large Language Models [70.64969663547703]
AdaCVD is an adaptable CVD risk prediction framework built on large language models extensively fine-tuned on over half a million participants from the UK Biobank.<n>It addresses key clinical challenges across three dimensions: it flexibly incorporates comprehensive yet variable patient information; it seamlessly integrates both structured data and unstructured text; and it rapidly adapts to new patient populations using minimal additional data.
arXiv Detail & Related papers (2025-05-30T14:42:02Z) - Temporal Entailment Pretraining for Clinical Language Models over EHR Data [9.584923572354045]
We introduce a novel temporal entailment pretraining objective for language models in the clinical domain.<n>Our method formulates EHR segments as temporally ordered sentence pairs and trains the model to determine whether a later state is entailed by, contradictory to, or neutral with respect to an earlier state.
arXiv Detail & Related papers (2025-04-25T07:30:38Z) - HC-LLM: Historical-Constrained Large Language Models for Radiology Report Generation [89.3260120072177]
We propose a novel Historical-Constrained Large Language Models (HC-LLM) framework for Radiology report generation.<n>Our approach extracts both time-shared and time-specific features from longitudinal chest X-rays and diagnostic reports to capture disease progression.<n> Notably, our approach performs well even without historical data during testing and can be easily adapted to other multimodal large models.
arXiv Detail & Related papers (2024-12-15T06:04:16Z) - CTPD: Cross-Modal Temporal Pattern Discovery for Enhanced Multimodal Electronic Health Records Analysis [50.56875995511431]
We introduce a Cross-Modal Temporal Pattern Discovery (CTPD) framework, designed to efficiently extract meaningful cross-modal temporal patterns from multimodal EHR data.<n>Our approach introduces shared initial temporal pattern representations which are refined using slot attention to generate temporal semantic embeddings.
arXiv Detail & Related papers (2024-11-01T15:54:07Z) - Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models [69.06149482021071]
We propose a novel EHR data generation model called EHRPD.
It is a diffusion-based model designed to predict the next visit based on the current one while also incorporating time interval estimation.
We conduct experiments on two public datasets and evaluate EHRPD from fidelity, privacy, and utility perspectives.
arXiv Detail & Related papers (2024-06-20T02:20:23Z) - TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic
Tree-Based Memory Network [54.332862955411656]
Clinical trials are critical for drug development but often suffer from expensive and inefficient patient recruitment.
In recent years, machine learning models have been proposed for speeding up patient recruitment via automatically matching patients with clinical trials.
We introduce a dynamic tree-based memory network model named TREEMENT to provide accurate and interpretable patient trial matching.
arXiv Detail & Related papers (2023-07-19T12:35:09Z) - SANSformers: Self-Supervised Forecasting in Electronic Health Records
with Attention-Free Models [48.07469930813923]
This work aims to forecast the demand for healthcare services, by predicting the number of patient visits to healthcare facilities.
We introduce SANSformer, an attention-free sequential model designed with specific inductive biases to cater for the unique characteristics of EHR data.
Our results illuminate the promising potential of tailored attention-free models and self-supervised pretraining in refining healthcare utilization predictions across various patient demographics.
arXiv Detail & Related papers (2021-08-31T08:23:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.