Related papers: MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities

MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities

URL: http://arxiv.org/abs/2511.20650v1
Date: Tue, 25 Nov 2025 18:59:53 GMT
Title: MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities
Authors: Tooba Tehreem Sheikh, Jean Lahoud, Rao Muhammad Anwer, Fahad Shahbaz Khan, Salman Khan, Hisham Cholakkal,
Abstract summary: We introduce MedROV, the first Real-time Open Vocabulary detection model for medical imaging.<n>By leveraging contrastive learning and cross-modal representations, MedROV effectively detects both known and novel structures.
Score: 89.81463562506637
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Traditional object detection models in medical imaging operate within a closed-set paradigm, limiting their ability to detect objects of novel labels. Open-vocabulary object detection (OVOD) addresses this limitation but remains underexplored in medical imaging due to dataset scarcity and weak text-image alignment. To bridge this gap, we introduce MedROV, the first Real-time Open Vocabulary detection model for medical imaging. To enable open-vocabulary learning, we curate a large-scale dataset, Omnis, with 600K detection samples across nine imaging modalities and introduce a pseudo-labeling strategy to handle missing annotations from multi-source datasets. Additionally, we enhance generalization by incorporating knowledge from a large pre-trained foundation model. By leveraging contrastive learning and cross-modal representations, MedROV effectively detects both known and novel structures. Experimental results demonstrate that MedROV outperforms the previous state-of-the-art foundation model for medical image detection with an average absolute improvement of 40 mAP50, and surpasses closed-set detectors by more than 3 mAP50, while running at 70 FPS, setting a new benchmark in medical detection. Our source code, dataset, and trained model are available at https://github.com/toobatehreem/MedROV.

Related papers

MedDIFT: Multi-Scale Diffusion-Based Correspondence in 3D Medical Imaging [6.520674045578402]
We present MedDIFT, a training-free 3D correspondence framework that leverages multi-scale features from a pretrained latent medical diffusion model as voxel descriptors.<n>On a publicly available lung CT dataset, MedDIFT achieves correspondence accuracy comparable to the state-of-the-art UniGradICON model.
arXiv Detail & Related papers (2025-12-05T09:53:07Z)
Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimer's Disease [3.46857682956989]
Medical vision-language models (Med-VLMs) have shown impressive results in tasks such as report generation and visual question answering.<n>Most existing models are typically trained from scratch or fine-tuned on large-scale 2D image-text pairs.<n>We propose a data-efficient fine-tuning pipeline to adapt 3D CT-based Med-VLMs for 3D MRI.
arXiv Detail & Related papers (2025-09-09T11:36:21Z)
MedFuncta: A Unified Framework for Learning Efficient Medical Neural Fields [17.156760213520055]
We introduce MedFuncta, a unified framework for large-scale NF training on diverse medical signals.<n>Our approach encodes data into a unified representation, namely a 1D latent vector, that modulates a shared, meta-learned NF.<n>We release our code, model weights and the first large-scale dataset - MedNF - containing > 500 k latent vectors for multi-instance medical NFs.
arXiv Detail & Related papers (2025-02-20T09:38:13Z)
UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities [68.12889379702824]
Vision-Language Models (VLMs) trained via contrastive learning have achieved notable success in natural image tasks.<n>UniMed is a large-scale, open-source multi-modal medical dataset comprising over 5.3 million image-text pairs.<n>We trained UniMed-CLIP, a unified VLM for six modalities, achieving notable gains in zero-shot evaluations.
arXiv Detail & Related papers (2024-12-13T18:59:40Z)
Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z)
Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images [68.42215385041114]
This paper introduces a novel lightweight multi-level adaptation and comparison framework to repurpose the CLIP model for medical anomaly detection. Our approach integrates multiple residual adapters into the pre-trained visual encoder, enabling a stepwise enhancement of visual features across different levels. Our experiments on medical anomaly detection benchmarks demonstrate that our method significantly surpasses current state-of-the-art models.
arXiv Detail & Related papers (2024-03-19T09:28:19Z)
Building RadiologyNET: Unsupervised annotation of a large-scale multimodal medical database [0.4915744683251151]
The usage of machine learning in medical diagnosis and treatment has witnessed significant growth in recent years. However, the availability of large annotated image datasets remains a major obstacle since the process of annotation is time-consuming and costly. This paper explores how to automatically annotate a database of medical radiology images with regard to their semantic similarity.
arXiv Detail & Related papers (2023-07-27T13:00:33Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.