When Do Domain-Specific Foundation Models Justify Their Cost? A Systematic Evaluation Across Retinal Imaging Tasks
- URL: http://arxiv.org/abs/2511.22001v1
- Date: Thu, 27 Nov 2025 00:59:21 GMT
- Title: When Do Domain-Specific Foundation Models Justify Their Cost? A Systematic Evaluation Across Retinal Imaging Tasks
- Authors: David Isztl, Tahm Spitznagel, Gabor Mark Somfai, Rui Santos,
- Abstract summary: We demonstrate that pretraining provides universal benefits (5.18-18.41% improvement), scaling with task difficulty.<n> compact general-purpose models deliver near-optimal performance for most retinal classification tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large vision foundation models have been widely adopted for retinal disease classification without systematic evidence justifying their parameter requirements. In the present work we address two critical questions: First, are large domain-specific foundation models essential, or do compact general-purpose architectures suffice? Second, does specialized retinal pretraining justify its computational cost? To answer this, we benchmark initialization strategies across four retinal imaging classification tasks spanning Optical Coherence Tomography (OCT) and Color Fundus Photography (CFP) modalities: 8-class OCT classification, 3-class diabetic macular edema (DME), 5-class diabetic retinopathy (DR), and 3-class glaucoma (GL) detection. We evaluate 12-13 model configurations per task, including vision transformers (22.8M-86.6M parameters), Swin Transformers (27.6M-28.3M), ConvNeXt (28.6M), and the domain-specific RETFound models (303M), under identical training conditions. Our results challenge prevailing assumptions: First, we demonstrate that pretraining provides universal benefits (5.18-18.41% improvement), scaling with task difficulty. Second, compact architectures (27-29M) dominate Pareto frontiers; SwinV2-tiny achieves top-1 performance on three datasets. Third, RETFound (303M) justifies its computational cost only for challenging DR grading (accuracy of 71.15%), while ImageNet pretraining proves to be sufficient with all other tasks (DME accuracy: 99.24%, OCT accuracy: 97.96%). CFP tasks show larger pretraining accuracy gains (9.13-18.41%) than OCT (5.18%). Thus, the evidence suggests that compact general-purpose models deliver near-optimal performance for most retinal classification tasks; specialized foundation models warranted only for fine-grained discrimination under extreme class imbalance.
Related papers
- Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints -- Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers [0.0]
Cell-level pathological image analysis requires working with extremely small image patches (40x40 pixels)<n>We systematically evaluated architectural suitability and data-scale effects for small-patch cell classification.
arXiv Detail & Related papers (2026-03-04T13:52:19Z) - Specializing Foundation Models via Mixture of Low-Rank Experts for Comprehensive Head CT Analysis [6.04562866374803]
We propose a Mixture of Low-Rank Experts (MoLRE) framework that extends LoRA with multiple specialized low-rank adapters and unsupervised soft routing.<n>We present a benchmark of MoLRE across six state-of-the-art medical imaging foundation models spanning 2D and 3D architectures, general-domain, medical-domain, and head CT-specific pretraining, and model sizes ranging from 7M to 431M parameters.
arXiv Detail & Related papers (2026-02-28T14:32:38Z) - Scaling Down to Scale Up: Towards Operationally-Efficient and Deployable Clinical Models via Cross-Modal Low-Rank Adaptation for Medical Vision-Language Models [0.30586855806896035]
Foundation models trained via vision-language pretraining have demonstrated strong zero-shot capabilities across diverse image domains.<n>We introduce MedCT-VLM: Medical CT Vision-Language Model, a parameter-efficient framework to adapt large-scale CT foundation models for downstream clinical tasks.<n>We evaluate on zero-shot classification across 18 thoracic pathologies, where the model must align CT embeddings with unseen text prompts at inference without task-specific training.
arXiv Detail & Related papers (2025-11-29T19:03:25Z) - Does DINOv3 Set a New Medical Vision Standard? [67.33543059306938]
This report investigates whether DINOv3 can serve as a powerful unified encoder for medical vision tasks without domain-specific pre-training.<n>We benchmark DINOv3 across common medical vision tasks, including 2D/3D classification and segmentation.<n>Remarkably, it can even outperform medical-specific foundation models like BiomedCLIP and CT-Net on several tasks.
arXiv Detail & Related papers (2025-09-08T09:28:57Z) - Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging [3.7942449131350413]
We propose Triad, a vision foundation model for 3D MRI.<n> Triad adopts a widely used autoencoder architecture to learn robust representations from 131,170 3D MRI volumes.<n>We evaluate Triad across three tasks, namely, organ/tumor segmentation, organ/cancer classification, and medical image registration.
arXiv Detail & Related papers (2025-02-19T19:31:52Z) - Brain Tumor Classification on MRI in Light of Molecular Markers [56.99710477905796]
Co-deletion of the 1p/19q gene is associated with clinical outcomes in low-grade gliomas.<n>This study aims to utilize a specially MRI-based convolutional neural network for brain cancer detection.
arXiv Detail & Related papers (2024-09-29T07:04:26Z) - Controllable retinal image synthesis using conditional StyleGAN and latent space manipulation for improved diagnosis and grading of diabetic retinopathy [0.0]
This paper proposes a framework for controllably generating high-fidelity and diverse DR fundus images.
We achieve comprehensive control over DR severity and visual features within generated images.
We manipulate the DR images generated conditionally on grades, further enhancing the dataset diversity.
arXiv Detail & Related papers (2024-09-11T17:08:28Z) - Video and Synthetic MRI Pre-training of 3D Vision Architectures for
Neuroimage Analysis [3.208731414009847]
Transfer learning involves pre-training deep learning models on a large corpus of data for adaptation to specific tasks.
We benchmarked vision transformers (ViTs) and convolutional neural networks (CNNs) with varied upstream pre-training approaches.
The resulting pre-trained models can be adapted to a range of downstream tasks, even when training data for the target task is limited.
arXiv Detail & Related papers (2023-09-09T00:33:23Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - An Ensemble Method to Automatically Grade Diabetic Retinopathy with
Optical Coherence Tomography Angiography Images [4.640835690336653]
We propose an ensemble method to automatically grade Diabetic retinopathy (DR) images available from Diabetic Retinopathy Analysis Challenge (DRAC) 2022.
First, we adopt the state-of-the-art classification networks, and train them to grade UW- OCTA images with different splits of the available dataset.
Ultimately, we obtain 25 models, of which, the top 16 models are selected and ensembled to generate the final predictions.
arXiv Detail & Related papers (2022-12-12T22:06:47Z) - Stacking Ensemble Learning in Deep Domain Adaptation for Ophthalmic
Image Classification [61.656149405657246]
Domain adaptation is effective in image classification tasks where obtaining sufficient label data is challenging.
We propose a novel method, named SELDA, for stacking ensemble learning via extending three domain adaptation methods.
The experimental results using Age-Related Eye Disease Study (AREDS) benchmark ophthalmic dataset demonstrate the effectiveness of the proposed model.
arXiv Detail & Related papers (2022-09-27T14:19:00Z) - Cross-Site Severity Assessment of COVID-19 from CT Images via Domain
Adaptation [64.59521853145368]
Early and accurate severity assessment of Coronavirus disease 2019 (COVID-19) based on computed tomography (CT) images offers a great help to the estimation of intensive care unit event.
To augment the labeled data and improve the generalization ability of the classification model, it is necessary to aggregate data from multiple sites.
This task faces several challenges including class imbalance between mild and severe infections, domain distribution discrepancy between sites, and presence of heterogeneous features.
arXiv Detail & Related papers (2021-09-08T07:56:51Z) - Robust Deep AUC Maximization: A New Surrogate Loss and Empirical Studies
on Medical Image Classification [63.44396343014749]
We propose a new margin-based surrogate loss function for the AUC score.
It is more robust than the commonly used.
square loss while enjoying the same advantage in terms of large-scale optimization.
To the best of our knowledge, this is the first work that makes DAM succeed on large-scale medical image datasets.
arXiv Detail & Related papers (2020-12-06T03:41:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.