Native Intelligence Emerges from Large-Scale Clinical Practice: A Retinal Foundation Model with Deployment Efficiency
- URL: http://arxiv.org/abs/2512.14499v1
- Date: Tue, 16 Dec 2025 15:33:08 GMT
- Title: Native Intelligence Emerges from Large-Scale Clinical Practice: A Retinal Foundation Model with Deployment Efficiency
- Authors: Jia Guo, Jiawei Du, Shengzhu Yang, Shuai Lu, Wenquan Cheng, Kaiwen Zhang, Yihua Sun, Chuhong Yang, Weihang Zhang, Fang Chen, Yilan Wu, Lie Ju, Guochen Ning, Longfei Ma, Huiping Yao, Jinyuan Wang, Peilun Shi, Yukun Zhou, Jie Xu, Pearse A. Keane, Hanruo Liu, Hongen Liao, Ningli Wang, Huiqi Li,
- Abstract summary: We present ReVision, a retinal foundation model that learns from the natural alignment between 485,980 color fundus photographs and their corresponding diagnostic reports.<n>In a prospective reader study with 33 ophthalmologists, ReVision's zero-shot assistance improved diagnostic accuracy by 14.8% across all experience levels.
- Score: 36.52215702000448
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current retinal foundation models remain constrained by curated research datasets that lack authentic clinical context, and require extensive task-specific optimization for each application, limiting their deployment efficiency in low-resource settings. Here, we show that these barriers can be overcome by building clinical native intelligence directly from real-world medical practice. Our key insight is that large-scale telemedicine programs, where expert centers provide remote consultations across distributed facilities, represent a natural reservoir for learning clinical image interpretation. We present ReVision, a retinal foundation model that learns from the natural alignment between 485,980 color fundus photographs and their corresponding diagnostic reports, accumulated through a decade-long telemedicine program spanning 162 medical institutions across China. Through extensive evaluation across 27 ophthalmic benchmarks, we demonstrate that ReVison enables deployment efficiency with minimal local resources. Without any task-specific training, ReVision achieves zero-shot disease detection with an average AUROC of 0.946 across 12 public benchmarks and 0.952 on 3 independent clinical cohorts. When minimal adaptation is feasible, ReVision matches extensively fine-tuned alternatives while requiring orders of magnitude fewer trainable parameters and labeled examples. The learned representations also transfer effectively to new clinical sites, imaging domains, imaging modalities, and systemic health prediction tasks. In a prospective reader study with 33 ophthalmologists, ReVision's zero-shot assistance improved diagnostic accuracy by 14.8% across all experience levels. These results demonstrate that clinical native intelligence can be directly extracted from clinical archives without any further annotation to build medical AI systems suited to various low-resource settings.
Related papers
- A Vision-Language Foundation Model for Zero-shot Clinical Collaboration and Automated Concept Discovery in Dermatology [31.639593207459058]
We introduce DermFM-Zero, a dermatology vision-language foundation model trained via masked latent modelling and contrastive learning.<n>We evaluated it across 20 benchmarks spanning zero-shot diagnosis and multimodal retrieval, achieving state-of-the-art performance without task-specific adaptation.<n>These findings demonstrate that a foundation model can provide effective, safe, and transparent zero-shot clinical decision support.
arXiv Detail & Related papers (2026-02-11T08:14:20Z) - Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis [7.945705180020063]
We propose a knowledge-enhanced joint embedding framework that integrates retinal fundus images, clinical text, and structured patient data.<n>Our framework achieves near-perfect text-to-image retrieval performance with Recall@1 of 99.94% compared to fine-tuned CLIP's 1.29%.
arXiv Detail & Related papers (2025-12-22T18:41:45Z) - Timely Clinical Diagnosis through Active Test Selection [49.091903570068155]
We propose ACTMED (Adaptive Clinical Test selection via Model-based Experimental Design) to better emulate real-world diagnostic reasoning.<n>LLMs act as flexible simulators, generating plausible patient state distributions and supporting belief updates without requiring structured, task-specific training data.<n>We evaluate ACTMED on real-world datasets and show it can optimize test selection to improve diagnostic accuracy, interpretability, and resource use.
arXiv Detail & Related papers (2025-10-21T18:10:45Z) - Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report Generation [61.350584471060756]
Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images.<n>We propose Self-Supervised Anatomical Consistency Learning (SS-ACL) to align generated reports with corresponding anatomical regions.<n>SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy.
arXiv Detail & Related papers (2025-09-30T08:59:06Z) - DermINO: Hybrid Pretraining for a Versatile Dermatology Foundation Model [92.66916452260553]
DermNIO is a versatile foundation model for dermatology.<n>It incorporates a novel hybrid pretraining framework that augments the self-supervised learning paradigm.<n>It consistently outperforms state-of-the-art models across a wide range of tasks.
arXiv Detail & Related papers (2025-08-17T00:41:39Z) - Retinal Fundus Multi-Disease Image Classification using Hybrid CNN-Transformer-Ensemble Architectures [0.3277163122167434]
Our research is motivated by the urgent global issue of a large population affected by retinal diseases.<n>Our primary objective is to develop a comprehensive diagnostic system capable of accurately predicting retinal diseases.
arXiv Detail & Related papers (2025-03-27T12:55:07Z) - How Good is my Histopathology Vision-Language Foundation Model? A Holistic Benchmark [21.47220651857942]
Histopathology vision-language foundation models (VLMs) have gained popularity due to their enhanced performance and generalizability across different downstream tasks.<n>Most existing histopathology benchmarks are either unimodal or limited in terms of diversity of clinical tasks, organs, and acquisition instruments, as well as their partial availability to the public due to patient data privacy.<n>We introduce HistoVL, a fully open-source comprehensive benchmark comprising images acquired using up to 11 various acquisition tools and captions by incorporating class names and diverse pathology descriptions.
arXiv Detail & Related papers (2025-03-17T09:45:22Z) - Federated Learning for Diabetic Retinopathy Diagnosis: Enhancing Accuracy and Generalizability in Under-Resourced Regions [0.3277163122167433]
Diabetic retinopathy is the leading cause of vision loss in working-age adults worldwide, yet under-resourced regions lack ophthalmologists.
Current state-of-the-art deep learning systems struggle at these institutions due to limited generalizability.
This paper explores a novel federated learning system for diabetic retinopathy diagnosis with the EfficientNetB0 architecture.
arXiv Detail & Related papers (2024-10-30T23:56:56Z) - Specialized curricula for training vision-language models in retinal image analysis [8.167708226285932]
Vision-language models (VLMs) automatically interpret images and summarize their findings as text.<n>In this work, we demonstrate that OpenAI's ChatGPT-4o model markedly underperforms compared to practicing ophthalmologists on specialist tasks.
arXiv Detail & Related papers (2024-07-11T11:31:48Z) - Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology.
For training, we assemble a large dataset of over 697 thousand radiology image-text pairs.
For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation.
The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z) - Robust and Efficient Medical Imaging with Self-Supervision [80.62711706785834]
We present REMEDIS, a unified representation learning strategy to improve robustness and data-efficiency of medical imaging AI.
We study a diverse range of medical imaging tasks and simulate three realistic application scenarios using retrospective data.
arXiv Detail & Related papers (2022-05-19T17:34:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.