Vision Foundry: A System for Training Foundational Vision AI Models
- URL: http://arxiv.org/abs/2512.11837v1
- Date: Wed, 03 Dec 2025 14:02:22 GMT
- Title: Vision Foundry: A System for Training Foundational Vision AI Models
- Authors: Mahmut S. Gokmen, Mitchell A. Klusty, Evan W. Damron, W. Vaiden Logan, Aaron D. Mullen, Caroline N. Leach, Emily B. Collier, Samuel E. Armstrong, V. K. Cody Bumgardner,
- Abstract summary: Vision Foundry is a code-free, HIPAA-compliant platform that democratizes pre-training, adaptation, and deployment of vision models.<n>By bridging the gap between advanced representation learning and practical application, Vision Foundry enables domain experts to develop state-of-the-art clinical AI tools.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised learning (SSL) leverages vast unannotated medical datasets, yet steep technical barriers limit adoption by clinical researchers. We introduce Vision Foundry, a code-free, HIPAA-compliant platform that democratizes pre-training, adaptation, and deployment of foundational vision models. The system integrates the DINO-MX framework, abstracting distributed infrastructure complexities while implementing specialized strategies like Magnification-Aware Distillation (MAD) and Parameter-Efficient Fine-Tuning (PEFT). We validate the platform across domains, including neuropathology segmentation, lung cellularity estimation, and coronary calcium scoring. Our experiments demonstrate that models trained via Vision Foundry significantly outperform generic baselines in segmentation fidelity and regression accuracy, while exhibiting robust zero-shot generalization across imaging protocols. By bridging the gap between advanced representation learning and practical application, Vision Foundry enables domain experts to develop state-of-the-art clinical AI tools with minimal annotation overhead, shifting focus from engineering optimization to clinical discovery.
Related papers
- MedGPT-oss: Training a General-Purpose Vision-Language Model for Biomedicine [38.06252990946545]
We introduce MEDGPT-OSS, an open-weight, 20B- parameter vision-language model designed to facilitate open research in clinical AI.<n>Rather than relying on architectural complexity, MEDGPT-OSS pairs the GPT-oss language backbone with a visual front-end via a optimized, three-stage training curriculum.<n>It successfully outperforms larger open medical models on out-of-distribution multimodal reasoning and complex text-only clinical tasks.
arXiv Detail & Related papers (2026-03-01T00:06:43Z) - MedEyes: Learning Dynamic Visual Focus for Medical Progressive Diagnosis [17.59077756990045]
MedEyes is a reinforcement learning framework that dynamically models clinician-style diagnostic reasoning.<n>It emulates the diagnostic process through a dual-mode exploration strategy, scanning for systematic abnormality localization and drilling for detailed regional analysis.<n>Experiments demonstrate that MedEyes achieves an average performance improvement of +8.5% across multiple medical VQA benchmarks.
arXiv Detail & Related papers (2025-11-27T01:47:43Z) - Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report Generation [61.350584471060756]
Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images.<n>We propose Self-Supervised Anatomical Consistency Learning (SS-ACL) to align generated reports with corresponding anatomical regions.<n>SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy.
arXiv Detail & Related papers (2025-09-30T08:59:06Z) - DiSSECT: Structuring Transfer-Ready Medical Image Representations through Discrete Self-Supervision [9.254163621425727]
DiSSECT is a framework that integrates multi-scale vector quantization into the SSL pipeline to impose a discrete representational bottleneck.<n>It achieves strong performance on both classification and segmentation tasks, requiring minimal or no fine-tuning.<n>We validate DiSSECT across multiple public medical imaging datasets, demonstrating its robustness and generalizability.
arXiv Detail & Related papers (2025-09-23T07:58:21Z) - Interpretable Clinical Classification with Kolgomorov-Arnold Networks [70.72819760172744]
Kolmogorov-Arnold Networks (KANs) offer intrinsic interpretability through transparent, symbolic representations.<n>KANs support built-in patient-level insights, intuitive visualizations, and nearest-patient retrieval.<n>These results position KANs as a promising step toward trustworthy AI that clinicians can understand, audit, and act upon.
arXiv Detail & Related papers (2025-09-20T17:21:58Z) - A Fully Open and Generalizable Foundation Model for Ultrasound Clinical Applications [77.3888788549565]
We present EchoCare, a novel ultrasound foundation model for generalist clinical use.<n>We developed EchoCare via self-supervised learning on our curated, publicly available, large-scale dataset EchoCareData.<n>With minimal training, EchoCare outperforms state-of-the-art comparison models across 10 representative ultrasound benchmarks.
arXiv Detail & Related papers (2025-09-15T10:05:31Z) - Leveraging the Structure of Medical Data for Improved Representation Learning [12.175375511821352]
Building generalizable medical AI systems requires pretraining strategies that are data-efficient and domain-aware.<n>We propose a self-supervised framework that leverages the inherent structure of medical datasets.<n>We show strong performance compared to supervised objectives and baselines being trained without leveraging structure.
arXiv Detail & Related papers (2025-07-01T11:14:45Z) - Decentralized LoRA Augmented Transformer with Context-aware Multi-scale Feature Learning for Secured Eye Diagnosis [2.1358421658740214]
This paper proposes a novel Data efficient Image Transformer (DeiT) based framework that integrates context aware multiscale patch embedding, Low-Rank Adaptation (LoRA), knowledge distillation, and federated learning to address these challenges in a unified manner.<n>The proposed model effectively captures both local and global retinal features by leveraging multi scale patch representations with local and global attention mechanisms.
arXiv Detail & Related papers (2025-05-11T13:51:56Z) - Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology.
For training, we assemble a large dataset of over 697 thousand radiology image-text pairs.
For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation.
The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z) - MLIP: Enhancing Medical Visual Representation with Divergence Encoder
and Knowledge-guided Contrastive Learning [48.97640824497327]
We propose a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning.
Our model includes global contrastive learning with our designed divergence encoder, local token-knowledge-patch alignment contrastive learning, and knowledge-guided category-level contrastive learning with expert knowledge.
Notably, MLIP surpasses state-of-the-art methods even with limited annotated data, highlighting the potential of multimodal pre-training in advancing medical representation learning.
arXiv Detail & Related papers (2024-02-03T05:48:50Z) - CLIP in Medical Imaging: A Survey [59.429714742927956]
Contrastive Language-Image Pre-training successfully introduces text supervision to vision models.<n>The use of CLIP has recently gained increasing interest in the medical imaging domain.
arXiv Detail & Related papers (2023-12-12T15:21:57Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.