MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
- URL: http://arxiv.org/abs/2512.06581v1
- Date: Sat, 06 Dec 2025 22:27:59 GMT
- Title: MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
- Authors: Yuhao Su, Anwesa Choudhuri, Zhongpai Gao, Benjamin Planche, Van Nguyen Nguyen, Meng Zheng, Yuhan Shen, Arun Innanje, Terrence Chen, Ehsan Elhamifar, Ziyan Wu,
- Abstract summary: We introduce textbfMedVidBench, a large-scale benchmark of 531,850 video-instruction pairs across 8 medical sources spanning video, segment, and frame-level tasks.<n>While supervised fine-tuning on MedVidBench yields noticeable gains, standard Reinforcement Learning fails due to imbalanced reward scales across datasets.<n>We introduce textbfMedGRPO, a novel RL framework for balanced multi-dataset training with two key innovations.
- Score: 47.843626983298726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large vision-language models struggle with medical video understanding, where spatial precision, temporal reasoning, and clinical semantics are critical. To address this, we first introduce \textbf{MedVidBench}, a large-scale benchmark of 531,850 video-instruction pairs across 8 medical sources spanning video, segment, and frame-level tasks, curated through a rigorous quality assurance pipeline with expert-guided prompting and dual-model validation. While supervised fine-tuning on MedVidBench yields noticeable gains, standard Reinforcement Learning (RL) fails due to imbalanced reward scales across datasets, which destabilizes optimization and leads to training collapse. To overcome this, we introduce \textbf{MedGRPO}, a novel RL framework for balanced multi-dataset training with two key innovations: (1) \emph{cross-dataset reward normalization} that maps each dataset's median performance to a common reward value, ensuring fair optimization regardless of difficulty, and (2) a \emph{medical LLM judge} that evaluates caption quality on five clinical dimensions through comparative similarity scoring. Supervised fine-tuning Qwen2.5-VL-7B on MedVidBench substantially outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks, demonstrating MedVidBench's efficacy, while our MedGRPO framework further improves upon the SFT baseline across grounding and captioning tasks. Our work establishes a foundational benchmark and robust training methodology for advancing vision-language models in medical domains. Our project website is available at https://yuhaosu.github.io/MedGRPO/.
Related papers
- MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images [25.29568841502814]
We introduce MedMO, a medical foundation model built upon a generalized MLLM architecture.<n>On VQA benchmarks, MedMO achieves an average accuracy improvement of +13.7% over the baseline.<n>In medical report generation, MedMO delivers significant gains in both semantic and clinical accuracy.
arXiv Detail & Related papers (2026-02-06T18:59:59Z) - Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation [25.148217482604746]
We propose VALOR:Visual Alignment of Medical Vision-Language Models for Radiology Report Generation.<n>Our method introduces a reinforcement learning-based post-alignment framework utilizing Group-Relative Proximal Optimization (GRPO)<n>Experiments on multiple benchmarks demonstrate that VALOR substantially improves factual accuracy and visual grounding, achieving significant performance gains over state-of-the-art report generation methods.
arXiv Detail & Related papers (2025-12-18T05:48:21Z) - MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning [52.064286116035134]
We develop MedAlign, a framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA)<n>We first propose a multimodal Direct Preference Optimization (mDPO) objective to align preference learning with visual context.<n>We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM.
arXiv Detail & Related papers (2025-10-24T02:11:05Z) - MedSeqFT: Sequential Fine-tuning Foundation Models for 3D Medical Image Segmentation [55.37355146924576]
MedSeqFT is a sequential fine-tuning framework for medical image analysis.<n>It adapts pre-trained models to new tasks while refining their representational capacity.<n>It consistently outperforms state-of-the-art fine-tuning strategies.
arXiv Detail & Related papers (2025-09-07T15:22:53Z) - Kwai Keye-VL Technical Report [80.53170317017147]
We introduce textbfKwai Keye-VL, a multimodal foundation model for short-video understanding.<n>The development of Keye-VL rests on two core pillars: a massive, high-quality dataset with a strong emphasis on video, and an innovative training recipe.<n>To validate our approach, we conduct extensive evaluations, showing that Keye-VL achieves state-of-the-art results on public video benchmarks and remains highly competitive on general image-based tasks.
arXiv Detail & Related papers (2025-07-02T17:57:28Z) - Efficient Medical VIE via Reinforcement Learning [10.713109515157475]
Visual Information Extraction (VIE) converts unstructured document images into structured formats like, structured formats like, critical for medical applications like report analysis and online consultations.<n>Traditional methods rely on OCR and language models, while end-to-end multimodal models offer direct generation.<n>We base our approach on the Reinforcement Learning with Verifiable Rewards (RLVR) framework to address these challenges using only 100 annotated samples.
arXiv Detail & Related papers (2025-06-16T11:10:25Z) - QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training [29.553607098450698]
QoQ-Med is the first open generalist clinical foundation model that jointly reasons across medical images, time-series signals, and text reports.<n>We show that DRPO training boosts diagnostic performance by 43% in macro-F1 on average across all visual domains.<n>With QoQ-Med trained on intensive segmentation data, it is able to highlight salient regions related to the diagnosis, with an IoU 10x higher than open models.
arXiv Detail & Related papers (2025-05-31T21:02:52Z) - MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis [10.082738539201804]
Recent vision-language foundation models deliver state-of-the-art results on natural image classification but falter on medical images due to domain shifts.<n>We introduce MedBridge, a lightweight multimodal adaptation framework that re-purposes pretrained VLMs for accurate medical image diagnosis.<n>MedBridge achieved over 6-15% improvement in AUC compared to state-of-the-art VLM adaptation methods in multi-label thoracic disease diagnosis.
arXiv Detail & Related papers (2025-05-27T19:37:51Z) - ExGra-Med: Extended Context Graph Alignment for Medical Vision-Language Models [95.47808515575382]
ExGra-Med is a novel framework for vision-language integration in medical AI.<n>It aligns images, instruction responses, and extended captions in the latent space, advancing semantic grounding and cross-modal coherence.<n>It matches LLaVA-Med's performance using just 10% of the pre-training data, achieving a 20.13% gain on VQA-RAD and approaching full-data performance.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation [51.509573838103854]
We propose a semi-supervised learning framework, termed Progressive Mean Teachers (PMT), for medical image segmentation.
Our PMT generates high-fidelity pseudo labels by learning robust and diverse features in the training process.
Experimental results on two datasets with different modalities, i.e., CT and MRI, demonstrate that our method outperforms the state-of-the-art medical image segmentation approaches.
arXiv Detail & Related papers (2024-09-08T15:02:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.