Related papers: Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging?

Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging?

URL: http://arxiv.org/abs/2510.10254v1
Date: Sat, 11 Oct 2025 15:19:03 GMT
Title: Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging?
Authors: Yuxiang Lai, Jike Zhong, Ming Li, Yuheng Li, Xiaofeng Yang,
Abstract summary: We evaluate a large vision model (LVM) in a zero-shot setting across four representative tasks.<n>The model can delineate anatomical structures in CT scans and achieve competitive performance on segmentation, denoising, and motion prediction.<n>We evaluate the LVM on 4D CT data from 122 patients, totaling over 1,820 3D CT volumes.
Score: 21.25724100313781
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in large generative models have shown that simple autoregressive formulations, when scaled appropriately, can exhibit strong zero-shot generalization across domains. Motivated by this trend, we investigate whether autoregressive video modeling principles can be directly applied to medical imaging tasks, despite the model never being trained on medical data. Specifically, we evaluate a large vision model (LVM) in a zero-shot setting across four representative tasks: organ segmentation, denoising, super-resolution, and motion prediction. Remarkably, even without domain-specific fine-tuning, the LVM can delineate anatomical structures in CT scans and achieve competitive performance on segmentation, denoising, and super-resolution. Most notably, in radiotherapy motion prediction, the model forecasts future 3D CT phases directly from prior phases of a 4D CT scan, producing anatomically consistent predictions that capture patient-specific respiratory dynamics with realistic temporal coherence. We evaluate the LVM on 4D CT data from 122 patients, totaling over 1,820 3D CT volumes. Despite no prior exposure to medical data, the model achieves strong performance across all tasks and surpasses specialized DVF-based and generative baselines in motion prediction, achieving state-of-the-art spatial accuracy. These findings reveal the emergence of zero-shot capabilities in medical video modeling and highlight the potential of general-purpose video models to serve as unified learners and reasoners laying the groundwork for future medical foundation models built on video models.

Related papers

CLARITY: Medical World Model for Guiding Treatment Decisions by Modeling Context-Aware Disease Trajectories in Latent Space [49.74032713886216]
CLARITY is a medical world model that forecasts disease evolution directly within a structured latent space.<n>It explicitly integrates time intervals (temporal context) and patient-specific data (clinical context) to model treatment-conditioned progression as a smooth, interpretable trajectory.
arXiv Detail & Related papers (2025-12-08T20:42:10Z)
Glioblastoma Overall Survival Prediction With Vision Transformers [6.318465743962574]
Glioblastoma is one of the most aggressive and common brain tumors, with a median survival of 10-15 months.<n>In this study, we propose a novel Artificial Intelligence (AI) approach for Overall Survival (OS) prediction using Magnetic Resonance Imaging (MRI) images.<n>We exploit Vision Transformers (ViTs) to extract hidden features directly from MRI images, eliminating the need of tumor segmentation.<n>The proposed model was evaluated on the BRATS dataset, reaching an accuracy of 62.5% on the test set, comparable to the top-performing methods.
arXiv Detail & Related papers (2025-08-04T13:59:57Z)
Towards a general-purpose foundation model for fMRI analysis [58.06455456423138]
We introduce NeuroSTORM, a framework that learns from 4D fMRI volumes and enables efficient knowledge transfer across diverse applications.<n>NeuroSTORM is pre-trained on 28.65 million fMRI frames (>9,000 hours) from over 50,000 subjects across multiple centers and ages 5 to 100.<n>It outperforms existing methods across five tasks: age/gender prediction, phenotype prediction, disease diagnosis, fMRI-to-image retrieval, and task-based fMRI.
arXiv Detail & Related papers (2025-06-11T23:51:01Z)
A versatile foundation model for cine cardiac magnetic resonance image analysis tasks [6.488550274514015]
We present a versatile foundation model that can perform a range of clinically-relevant image analysis tasks.<n>A multi-view convolution-transformer masked autoencoder, named as CineMA, was trained on 15 million cine images from 74,916 subjects.
arXiv Detail & Related papers (2025-05-31T19:12:34Z)
Computationally Efficient Diffusion Models in Medical Imaging: A Comprehensive Review [8.314889727337198]
The diffusion model has emerged as a potent approach in computer vision, demonstrating remarkable performances in the field of generative artificial intelligence.<n>We present the most recent advances in diffusion models by categorizing them into three key models: the Denoising Diffusion Probabilistic Model (DDPM), the Latent Diffusion Model (LDM), and the Wavelet Diffusion Model (WDM)<n>These models play a crucial role in medical imaging, where producing fast, reliable, and high-quality medical images is essential for accurate analysis of abnormalities and disease diagnosis.
arXiv Detail & Related papers (2025-05-09T07:56:04Z)
3D Foundation Model for Generalizable Disease Detection in Head Computed Tomography [5.65192078662102]
We introduce FM-CT: a Foundation Model for Head CT for generalizable disease detection, trained using self-supervised learning.<n>Our approach pre-trains a deep learning model on a large, diverse dataset of 361,663 non-contrast 3D head CT scans without the need for manual annotations.<n>Our results demonstrate that the self-supervised foundation model significantly improves performance on downstream diagnostic tasks.
arXiv Detail & Related papers (2025-02-04T23:42:18Z)
Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis [55.959002385347645]
Latent Drifting enables diffusion models to be conditioned for medical images fitted for the complex task of counterfactual image generation.<n>We evaluate our method on three public longitudinal benchmark datasets of brain MRI and chest X-rays for counterfactual image generation.
arXiv Detail & Related papers (2024-12-30T01:59:34Z)
Exploring Foundation Models for Synthetic Medical Imaging: A Study on Chest X-Rays and Fine-Tuning Techniques [0.49000940389224884]
Machine learning has significantly advanced healthcare by aiding in disease prevention and treatment identification. However, accessing patient data can be challenging due to privacy concerns and strict regulations. Recent studies suggest that fine-tuning foundation models can produce such data effectively.
arXiv Detail & Related papers (2024-09-06T17:36:08Z)
Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology. For training, we assemble a large dataset of over 697 thousand radiology image-text pairs. For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation. The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
Explainable multiple abnormality classification of chest CT volumes with AxialNet and HiResCAM [89.2175350956813]
We introduce the challenging new task of explainable multiple abnormality classification in volumetric medical images. We propose a multiple instance learning convolutional neural network, AxialNet, that allows identification of top slices for each abnormality. We then aim to improve the model's learning through a novel mask loss that leverages HiResCAM and 3D allowed regions.
arXiv Detail & Related papers (2021-11-24T01:14:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.