Related papers: Beyond Generative AI: World Models for Clinical Prediction, Counterfactuals, and Planning

Beyond Generative AI: World Models for Clinical Prediction, Counterfactuals, and Planning

URL: http://arxiv.org/abs/2511.16333v1
Date: Thu, 20 Nov 2025 13:11:45 GMT
Title: Beyond Generative AI: World Models for Clinical Prediction, Counterfactuals, and Planning
Authors: Mohammad Areeb Qazi, Maryam Nadeem, Mohammad Yaqub,
Abstract summary: World models learn predictive dynamics to enable multistep rollouts, counterfactual evaluation and planning.<n>Most reviewed systems achieve L1--L2, with fewer instances of L3 and rare L4 planning/control.<n>This review outlines a research agenda for clinically robust prediction-first world models.
Score: 2.800335887892072
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Healthcare requires AI that is predictive, reliable, and data-efficient. However, recent generative models lack physical foundation and temporal reasoning required for clinical decision support. As scaling language models show diminishing returns for grounded clinical reasoning, world models are gaining traction because they learn multimodal, temporally coherent, and action-conditioned representations that reflect the physical and causal structure of care. This paper reviews World Models for healthcare systems that learn predictive dynamics to enable multistep rollouts, counterfactual evaluation and planning. We survey recent work across three domains: (i) medical imaging and diagnostics (e.g., longitudinal tumor simulation, projection-transition modeling, and Joint Embedding Predictive Architecture i.e., JEPA-style predictive representation learning), (ii) disease progression modeling from electronic health records (generative event forecasting at scale), and (iii) robotic surgery and surgical planning (action-conditioned guidance and control). We also introduce a capability rubric: L1 temporal prediction, L2 action-conditioned prediction, L3 counterfactual rollouts for decision support, and L4 planning/control. Most reviewed systems achieve L1--L2, with fewer instances of L3 and rare L4. We identify cross-cutting gaps that limit clinical reliability; under-specified action spaces and safety constraints, weak interventional validation, incomplete multimodal state construction, and limited trajectory-level uncertainty calibration. This review outlines a research agenda for clinically robust prediction-first world models that integrate generative backbones (transformers, diffusion, VAE) with causal/mechanical foundation for safe decision support in healthcare.

Related papers

A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine [59.78991974851707]
Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis.<n>Most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems.<n>We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications.
arXiv Detail & Related papers (2026-01-29T18:48:21Z)
CLARITY: Medical World Model for Guiding Treatment Decisions by Modeling Context-Aware Disease Trajectories in Latent Space [49.74032713886216]
CLARITY is a medical world model that forecasts disease evolution directly within a structured latent space.<n>It explicitly integrates time intervals (temporal context) and patient-specific data (clinical context) to model treatment-conditioned progression as a smooth, interpretable trajectory.
arXiv Detail & Related papers (2025-12-08T20:42:10Z)
OncoReason: Structuring Clinical Reasoning in LLMs for Robust and Interpretable Survival Prediction [2.904892426557913]
Large language models (LLMs) have shown strong performance in biomedical NLP.<n>We present a unified, multi-task learning framework that aligns autoregressive LLMs with clinical reasoning for outcome prediction.<n>Our findings underscore the importance of reasoning-aware alignment in multi-task clinical modeling.
arXiv Detail & Related papers (2025-10-20T13:35:12Z)
SurgeryLSTM: A Time-Aware Neural Model for Accurate and Explainable Length of Stay Prediction After Spine Surgery [44.119171920037196]
We develop and evaluate machine learning (ML) models for predicting length of stay (LOS) in elective spine surgery.<n>We compare traditional ML models with our developed model, SurgeryLSTM, a masked bidirectional long short-term memory (BiLSTM) with an attention.<n>Performance was evaluated using the coefficient of determination (R2) and key predictors were identified using explainable AI.
arXiv Detail & Related papers (2025-07-15T01:18:28Z)
Keeping Medical AI Healthy: A Review of Detection and Correction Methods for System Degradation [6.781778751487079]
This review presents a forward-looking perspective on monitoring and maintaining the "health" of AI systems in healthcare.<n>We highlight the urgent need for continuous performance monitoring, early degradation detection, and effective self-correction mechanisms.<n>This work aims to guide the development of reliable, robust medical AI systems capable of sustaining safe, long-term deployment in dynamic clinical settings.
arXiv Detail & Related papers (2025-06-20T19:22:07Z)
Will Large Language Models Transform Clinical Prediction? [6.239284099493876]
Large language models (LLMs) are attracting increasing interest in healthcare.<n>This commentary evaluates the potential of LLMs to improve clinical prediction models (CPMs) for diagnostic and prognostic tasks.
arXiv Detail & Related papers (2025-05-23T17:02:04Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Clairvoyance: A Pipeline Toolkit for Medical Time Series [95.22483029602921]
Time-series learning is the bread and butter of data-driven *clinical decision support* Clairvoyance proposes a unified, end-to-end, autoML-friendly pipeline that serves as a software toolkit. Clairvoyance is the first to demonstrate viability of a comprehensive and automatable pipeline for clinical time-series ML.
arXiv Detail & Related papers (2023-10-28T12:08:03Z)
Robust and Interpretable Medical Image Classifiers via Concept Bottleneck Models [49.95603725998561]
We propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts. Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model.
arXiv Detail & Related papers (2023-10-04T21:57:09Z)
Safe AI for health and beyond -- Monitoring to transform a health service [51.8524501805308]
We will assess the infrastructure required to monitor the outputs of a machine learning algorithm. We will present two scenarios with examples of monitoring and updates of models.
arXiv Detail & Related papers (2023-03-02T17:27:45Z)
Foresight -- Deep Generative Modelling of Patient Timelines using Electronic Health Records [46.024501445093755]
Temporal modelling of medical history can be used to forecast and simulate future events, estimate risk, suggest alternative diagnoses or forecast complications. We present Foresight, a novel GPT3-based pipeline that uses NER+L tools (i.e. MedCAT) to convert document text into structured, coded concepts.
arXiv Detail & Related papers (2022-12-13T19:06:00Z)
Clinical Outcome Prediction from Admission Notes using Self-Supervised Knowledge Integration [55.88616573143478]
Outcome prediction from clinical text can prevent doctors from overlooking possible risks. Diagnoses at discharge, procedures performed, in-hospital mortality and length-of-stay prediction are four common outcome prediction targets. We propose clinical outcome pre-training to integrate knowledge about patient outcomes from multiple public sources.
arXiv Detail & Related papers (2021-02-08T10:26:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.