Related papers: MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

URL: http://arxiv.org/abs/2602.06965v1
Date: Fri, 06 Feb 2026 18:59:59 GMT
Title: MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images
Authors: Ankan Deria, Komal Kumar, Adinath Madhavrao Dukre, Eran Segal, Salman Khan, Imran Razzak,
Abstract summary: We introduce MedMO, a medical foundation model built upon a generalized MLLM architecture.<n>On VQA benchmarks, MedMO achieves an average accuracy improvement of +13.7% over the baseline.<n>In medical report generation, MedMO delivers significant gains in both semantic and clinical accuracy.
Score: 25.29568841502814
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) have rapidly advanced, yet their adoption in medicine remains limited by gaps in domain coverage, modality alignment, and grounded reasoning. In this work, we introduce MedMO, a medical foundation model built upon a generalized MLLM architecture and trained exclusively on large-scale, domain-specific data. MedMO follows a multi-stage training recipe: (i) cross-modal pretraining to align heterogeneous visual encoders with a medical language backbone; (ii) instruction tuning on multi-task supervision that spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes; and (iii) reinforcement learning with verifiable rewards that combine factuality checks with a box-level GIoU reward to strengthen spatial grounding and step-by-step reasoning in complex clinical scenarios. MedMO consistently outperforms strong open-source medical MLLMs across multiple modalities and tasks. On VQA benchmarks, MedMO achieves an average accuracy improvement of +13.7% over the baseline and performs within 1.9% of the SOTA Fleming-VL. For text-based QA, it attains +6.9% over the baseline and +14.5% over Fleming-VL. In medical report generation, MedMO delivers significant gains in both semantic and clinical accuracy. Moreover, it exhibits strong grounding capability, achieving an IoU improvement of +40.4 over the baseline and +37.0% over Fleming-VL, underscoring its robust spatial reasoning and localization performance. Evaluations across radiology, ophthalmology, and pathology-microscopy confirm MedMO's broad cross-modality generalization. We release two versions of MedMO: 4B and 8B. Project is available at https://genmilab.github.io/MedMO-Page

Related papers

A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine [59.78991974851707]
Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis.<n>Most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems.<n>We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications.
arXiv Detail & Related papers (2026-01-29T18:48:21Z)
MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding [47.843626983298726]
We introduce textbfMedVidBench, a large-scale benchmark of 531,850 video-instruction pairs across 8 medical sources spanning video, segment, and frame-level tasks.<n>While supervised fine-tuning on MedVidBench yields noticeable gains, standard Reinforcement Learning fails due to imbalanced reward scales across datasets.<n>We introduce textbfMedGRPO, a novel RL framework for balanced multi-dataset training with two key innovations.
arXiv Detail & Related papers (2025-12-06T22:27:59Z)
Med-GLIP: Advancing Medical Language-Image Pre-training with Large-scale Grounded Dataset [18.29385508780721]
Med-GLIP is a modality-aware grounding framework trained on Med-GLIP-5M.<n>It implicitly acquires hierarchical semantic understanding from diverse training data.<n>It consistently outperforms state-of-the-art baselines across multiple grounding benchmarks.
arXiv Detail & Related papers (2025-08-14T11:02:38Z)
MedGemma Technical Report [75.88152277443179]
We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B.<n>MedGemma demonstrates advanced medical understanding and reasoning on images and text.<n>We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP.
arXiv Detail & Related papers (2025-07-07T17:01:44Z)
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning [57.873833577058]
We build a multimodal dataset enriched with extensive medical knowledge.<n>We then introduce our medical-specialized MLLM: Lingshu.<n>Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities.
arXiv Detail & Related papers (2025-06-08T08:47:30Z)
QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training [29.553607098450698]
QoQ-Med is the first open generalist clinical foundation model that jointly reasons across medical images, time-series signals, and text reports.<n>We show that DRPO training boosts diagnostic performance by 43% in macro-F1 on average across all visual domains.<n>With QoQ-Med trained on intensive segmentation data, it is able to highlight salient regions related to the diagnosis, with an IoU 10x higher than open models.
arXiv Detail & Related papers (2025-05-31T21:02:52Z)
InfiMed: Low-Resource Medical MLLMs with Advancing Understanding and Reasoning [19.791150694039466]
We introduce our InfiMed-Series models, InfiMed-SFT-3B and InfiMed-RL-3B, both of which deliver state-of-the-art performance across seven multimodal medical benchmarks.<n>InfiMed-RL-3B achieves an average accuracy of 59.2%, outperforming even larger models like InternVL3-8B, which achieves 57.3%.
arXiv Detail & Related papers (2025-05-29T10:31:57Z)
MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis [10.082738539201804]
Recent vision-language foundation models deliver state-of-the-art results on natural image classification but falter on medical images due to domain shifts.<n>We introduce MedBridge, a lightweight multimodal adaptation framework that re-purposes pretrained VLMs for accurate medical image diagnosis.<n>MedBridge achieved over 6-15% improvement in AUC compared to state-of-the-art VLM adaptation methods in multi-label thoracic disease diagnosis.
arXiv Detail & Related papers (2025-05-27T19:37:51Z)
Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts. MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation. MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z)
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals. GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.