Related papers: QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training

QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training

URL: http://arxiv.org/abs/2506.00711v2
Date: Wed, 22 Oct 2025 17:18:21 GMT
Title: QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training
Authors: Wei Dai, Peilin Chen, Chanakya Ekbote, Paul Pu Liang,
Abstract summary: QoQ-Med is the first open generalist clinical foundation model that jointly reasons across medical images, time-series signals, and text reports.<n>We show that DRPO training boosts diagnostic performance by 43% in macro-F1 on average across all visual domains.<n>With QoQ-Med trained on intensive segmentation data, it is able to highlight salient regions related to the diagnosis, with an IoU 10x higher than open models.
Score: 29.553607098450698
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Clinical decision-making routinely demands reasoning over heterogeneous data, yet existing multimodal language models (MLLMs) remain largely vision-centric and fail to generalize across clinical specialties. To bridge this gap, we introduce QoQ-Med-7B/32B, the first open generalist clinical foundation model that jointly reasons across medical images, time-series signals, and text reports. QoQ-Med is trained with Domain-aware Relative Policy Optimization (DRPO), a novel reinforcement-learning objective that hierarchically scales normalized rewards according to domain rarity and modality difficulty, mitigating performance imbalance caused by skewed clinical data distributions. Trained on 2.61 million instruction tuning pairs spanning 9 clinical domains, we show that DRPO training boosts diagnostic performance by 43% in macro-F1 on average across all visual domains as compared to other critic-free training methods like GRPO. Furthermore, with QoQ-Med trained on intensive segmentation data, it is able to highlight salient regions related to the diagnosis, with an IoU 10x higher than open models while reaching the performance of OpenAI o4-mini. To foster reproducibility and downstream research, we release (i) the full model weights, (ii) the modular training pipeline, and (iii) all intermediate reasoning traces at https://github.com/DDVD233/QoQ_Med.

Related papers

MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images [25.29568841502814]
We introduce MedMO, a medical foundation model built upon a generalized MLLM architecture.<n>On VQA benchmarks, MedMO achieves an average accuracy improvement of +13.7% over the baseline.<n>In medical report generation, MedMO delivers significant gains in both semantic and clinical accuracy.
arXiv Detail & Related papers (2026-02-06T18:59:59Z)
A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine [59.78991974851707]
Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis.<n>Most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems.<n>We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications.
arXiv Detail & Related papers (2026-01-29T18:48:21Z)
MMedExpert-R1: Strengthening Multimodal Medical Reasoning via Domain-Specific Adaptation and Clinical Guideline Reinforcement [63.82954136824963]
Medical Vision-Language Models excel at perception tasks with complex clinical reasoning required in real-world scenarios.<n>We propose a novel reasoning MedVLM that addresses these challenges through domain-specific adaptation and guideline reinforcement.
arXiv Detail & Related papers (2026-01-16T02:32:07Z)
MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding [47.843626983298726]
We introduce textbfMedVidBench, a large-scale benchmark of 531,850 video-instruction pairs across 8 medical sources spanning video, segment, and frame-level tasks.<n>While supervised fine-tuning on MedVidBench yields noticeable gains, standard Reinforcement Learning fails due to imbalanced reward scales across datasets.<n>We introduce textbfMedGRPO, a novel RL framework for balanced multi-dataset training with two key innovations.
arXiv Detail & Related papers (2025-12-06T22:27:59Z)
MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning [52.064286116035134]
We develop MedAlign, a framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA)<n>We first propose a multimodal Direct Preference Optimization (mDPO) objective to align preference learning with visual context.<n>We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM.
arXiv Detail & Related papers (2025-10-24T02:11:05Z)
UNICON: UNIfied CONtinual Learning for Medical Foundational Models [0.8672882547905405]
In medical imaging, the scarcity of data makes pre-training for every domain, modality, or task challenging.<n>Continual learning offers a solution by fine-tuning a model sequentially on different domains or tasks.<n>We propose UNIfied CONtinual Learning for Medical Foundational Models (UNICON), a framework that enables seamless adaptation of foundation models.
arXiv Detail & Related papers (2025-08-19T17:31:32Z)
Towards a general-purpose foundation model for fMRI analysis [58.06455456423138]
We introduce NeuroSTORM, a framework that learns from 4D fMRI volumes and enables efficient knowledge transfer across diverse applications.<n>NeuroSTORM is pre-trained on 28.65 million fMRI frames (>9,000 hours) from over 50,000 subjects across multiple centers and ages 5 to 100.<n>It outperforms existing methods across five tasks: age/gender prediction, phenotype prediction, disease diagnosis, fMRI-to-image retrieval, and task-based fMRI.
arXiv Detail & Related papers (2025-06-11T23:51:01Z)
InfiMed: Low-Resource Medical MLLMs with Advancing Understanding and Reasoning [19.791150694039466]
We introduce our InfiMed-Series models, InfiMed-SFT-3B and InfiMed-RL-3B, both of which deliver state-of-the-art performance across seven multimodal medical benchmarks.<n>InfiMed-RL-3B achieves an average accuracy of 59.2%, outperforming even larger models like InternVL3-8B, which achieves 57.3%.
arXiv Detail & Related papers (2025-05-29T10:31:57Z)
MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis [10.082738539201804]
Recent vision-language foundation models deliver state-of-the-art results on natural image classification but falter on medical images due to domain shifts.<n>We introduce MedBridge, a lightweight multimodal adaptation framework that re-purposes pretrained VLMs for accurate medical image diagnosis.<n>MedBridge achieved over 6-15% improvement in AUC compared to state-of-the-art VLM adaptation methods in multi-label thoracic disease diagnosis.
arXiv Detail & Related papers (2025-05-27T19:37:51Z)
Improving Medical Reasoning with Curriculum-Aware Reinforcement Learning [2.262453679768892]
We introduce textbfMedCCO, the first multimodal reinforcement learning framework tailored for medical VQA.<n>MedCCO is fine-tuned on a diverse set of close-ended medical VQA tasks to establish domain-grounded reasoning capabilities.<n>We validate MedCCO across eight challenging medical VQA benchmarks, spanning both close-ended and open-ended settings.
arXiv Detail & Related papers (2025-05-25T16:20:55Z)
CLIMB: Data Foundations for Large Scale Multimodal Clinical Foundation Models [27.726366396356763]
We introduce Clinical Large-Scale Integrative Multimodal Benchmark ( CLIMB)<n> CLIMB is a comprehensive benchmark unifying diverse clinical data across imaging, language, temporal, and graph modalities.<n>Pretraining on CLIMB effectively improves models' generalization capability to new tasks, and strong unimodal encoder performance translates well to multimodal performance when paired with task-appropriate fusion strategies.
arXiv Detail & Related papers (2025-03-09T01:45:05Z)
Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking [58.25862290294702]
We present MedChain, a dataset of 12,163 clinical cases that covers five key stages of clinical workflow.<n>We also propose MedChain-Agent, an AI system that integrates a feedback mechanism and a MCase-RAG module to learn from previous cases and adapt its responses.
arXiv Detail & Related papers (2024-12-02T15:25:02Z)
Repurposing Foundation Model for Generalizable Medical Time Series Classification [16.21546283978257]
FORMED is a framework for repurposing a backbone foundation model to enable highly generalizable MedTS classification on unseen datasets.<n>We evaluate FORMED on 5 diverse MedTS datasets, benchmarking against 11 Task-Specific Models (TSM) and 4 Task-Specific Adaptation (TSA) methods.<n>Our results demonstrate FORMED's dominant performance, achieving up to 35% absolute improvement in F1-score (on ADFTD dataset) over specialized baselines.
arXiv Detail & Related papers (2024-10-03T23:50:04Z)
Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology. For training, we assemble a large dataset of over 697 thousand radiology image-text pairs. For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation. The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
Competence-based Multimodal Curriculum Learning for Medical Report Generation [98.10763792453925]
We propose a Competence-based Multimodal Curriculum Learning framework ( CMCL) to alleviate the data bias and make best use of available data. Specifically, CMCL simulates the learning process of radiologists and optimize the model in a step by step manner. Experiments on the public IU-Xray and MIMIC-CXR datasets show that CMCL can be incorporated into existing models to improve their performance.
arXiv Detail & Related papers (2022-06-24T08:16:01Z)
A multi-stage machine learning model on diagnosis of esophageal manometry [50.591267188664666]
The framework includes deep-learning models at the swallow-level stage and feature-based machine learning models at the study-level stage. This is the first artificial-intelligence-style model to automatically predict CC diagnosis of HRM study from raw multi-swallow data.
arXiv Detail & Related papers (2021-06-25T20:09:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.