Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring
- URL: http://arxiv.org/abs/2603.01557v1
- Date: Mon, 02 Mar 2026 07:33:11 GMT
- Title: Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring
- Authors: Aditya Shukla, Yining Yuan, Ben Tamo, Yifei Wang, Micky Nnamdi, Shaun Tan, Jieru Li, Benoit Marteau, Brad Willingham, May Wang,
- Abstract summary: Large language models (LLMs) can generate fluent clinical summaries of remote therapeutic monitoring time series.<n> Existing evaluation metrics primarily focus on semantic similarity and linguistic quality, leaving event-level correctness largely unmeasured.<n>We introduce an event-based evaluation framework for multimodal time-series summarization using the Technology-Integrated Health Management (TIHM)-1.5 dementia monitoring dataset.
- Score: 6.415950855665798
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) can generate fluent clinical summaries of remote therapeutic monitoring time series. However, it remains unclear whether these narratives faithfully capture clinically significant events, such as sustained abnormalities. Existing evaluation metrics primarily focus on semantic similarity and linguistic quality, leaving event-level correctness largely unmeasured. To address this gap, we introduce an event-based evaluation framework for multimodal time-series summarization using the Technology-Integrated Health Management (TIHM)-1.5 dementia monitoring dataset. Clinically grounded daily events are derived through rule-based abnormal thresholds and temporal persistence criteria. Model-generated summaries are then aligned with these structured facts. Our evaluation protocol measures abnormality recall, duration recall, measurement coverage, and hallucinated event mentions. We benchmark three approaches: zero-shot prompting, statistical prompting, and a vision-based pipeline that uses rendered time-series visualizations. The results reveal a striking decoupling between conventional metrics and clinical event fidelity. Models that achieve high semantic similarity scores often exhibit near-zero abnormality recall. In contrast, the vision-based approach demonstrates the strongest event alignment, achieving 45.7% abnormality recall and 100% duration recall. These findings underscore the importance of event-aware evaluation to ensure reliable clinical time-series summarization.
Related papers
- Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering [94.37535002230504]
We develop a training-free, inference-time control framework termed Semantically Decoupled Latent Steering.<n>Our approach constructs a semantic-free intervention vector via large language model (LLM)-driven semantic decomposition.<n>We show that our approach significantly reduces the probability of historical hallucinations.
arXiv Detail & Related papers (2026-02-27T04:49:01Z) - Time-to-Event Transformer to Capture Timing Attention of Events in EHR Time Series [15.049813932448112]
LITT is a novel Timing-Transformer architecture that enables temporary alignment of sequential events on a virtual relative timeline''<n>Its interpretability and effectiveness are validated on real-world longitudinal EHR data from 3,276 breast cancer patients.
arXiv Detail & Related papers (2026-02-11T00:13:08Z) - Mind the Missing: Variable-Aware Representation Learning for Irregular EHR Time Series using Large Language Models [0.6554326244334866]
VITAL is a variable-aware, large language model (LLM) based framework tailored for learning from irregularly sampled physiological time series.<n>It reprograms vital signs into the language space, enabling the LLM to capture temporal context and reason over missing values.<n>It maintains robust performance under high levels of missingness, which is prevalent in real world clinical scenarios.
arXiv Detail & Related papers (2025-09-26T09:44:16Z) - Evaluation of Stress Detection as Time Series Events -- A Novel Window-Based F1-Metric [3.0936815707071403]
Time series evaluation is essential for applications such as stress monitoring with wearable devices.<n>Standard metrics like F1 often misrepresent model performance in real-world, imbalanced datasets.<n>We introduce a window-based F1 metric (F1$_w$) that incorporates temporal tolerance.
arXiv Detail & Related papers (2025-09-03T11:55:28Z) - A Large-Language Model Framework for Relative Timeline Extraction from PubMed Case Reports [10.869574822060553]
We present a system that transforms case reports into textual time series-structured pairs of textual events and timestamps.<n>This work may serve as a benchmark for leveraging the PMOA corpus for temporal analytics.
arXiv Detail & Related papers (2025-04-15T20:54:19Z) - ProMedTS: A Self-Supervised, Prompt-Guided Multimodal Approach for Integrating Medical Text and Time Series [27.70300880284899]
Large language models (LLMs) have shown remarkable performance in vision-grained tasks, but their application in the medical field remains underexplored.<n>We introduce ProMedTS, a novel self-supervised multimodal framework that employs prompt-guided learning to unify data types.<n>We evaluate ProMedTS on disease diagnosis tasks using real-world datasets, and the results demonstrate that our method consistently outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2025-02-19T07:56:48Z) - CTPD: Cross-Modal Temporal Pattern Discovery for Enhanced Multimodal Electronic Health Records Analysis [50.56875995511431]
We introduce a Cross-Modal Temporal Pattern Discovery (CTPD) framework, designed to efficiently extract meaningful cross-modal temporal patterns from multimodal EHR data.<n>Our approach introduces shared initial temporal pattern representations which are refined using slot attention to generate temporal semantic embeddings.
arXiv Detail & Related papers (2024-11-01T15:54:07Z) - Deep State-Space Generative Model For Correlated Time-to-Event Predictions [54.3637600983898]
We propose a deep latent state-space generative model to capture the interactions among different types of correlated clinical events.
Our method also uncovers meaningful insights about the latent correlations among mortality and different types of organ failures.
arXiv Detail & Related papers (2024-07-28T02:42:36Z) - CenTime: Event-Conditional Modelling of Censoring in Survival Analysis [49.44664144472712]
We introduce CenTime, a novel approach to survival analysis that directly estimates the time to event.
Our method features an innovative event-conditional censoring mechanism that performs robustly even when uncensored data is scarce.
Our results indicate that CenTime offers state-of-the-art performance in predicting time-to-death while maintaining comparable ranking performance.
arXiv Detail & Related papers (2023-09-07T17:07:33Z) - Multi-view Integration Learning for Irregularly-sampled Clinical Time
Series [1.9639092030562577]
We propose a multi-view features integration learning from irregular time series data by self-attention mechanism in an imputation-free manner.
We explicitly learn the relationships among the observed values, missing indicators, and time interval between the consecutive observations, simultaneously.
We build an attention-based decoder as a missing value imputer that helps empower the representation learning of the inter-relations among multi-view observations.
arXiv Detail & Related papers (2021-01-25T10:02:50Z) - MIA-Prognosis: A Deep Learning Framework to Predict Therapy Response [58.0291320452122]
This paper aims at a unified deep learning approach to predict patient prognosis and therapy response.
We formalize the prognosis modeling as a multi-modal asynchronous time series classification task.
Our predictive model could further stratify low-risk and high-risk patients in terms of long-term survival.
arXiv Detail & Related papers (2020-10-08T15:30:17Z) - Predicting Parkinson's Disease with Multimodal Irregularly Collected
Longitudinal Smartphone Data [75.23250968928578]
Parkinsons Disease is a neurological disorder and prevalent in elderly people.
Traditional ways to diagnose the disease rely on in-person subjective clinical evaluations on the quality of a set of activity tests.
We propose a novel time-series based approach to predicting Parkinson's Disease with raw activity test data collected by smartphones in the wild.
arXiv Detail & Related papers (2020-09-25T01:50:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.