MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization
- URL: http://arxiv.org/abs/2507.02994v1
- Date: Tue, 01 Jul 2025 21:51:42 GMT
- Title: MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization
- Authors: Huihui Xu, Yuanpeng Nie, Hualiang Wang, Ying Chen, Wei Li, Junzhi Ning, Lihao Liu, Hongqiu Wang, Lei Zhu, Jiyao Liu, Xiaomeng Li, Junjun He,
- Abstract summary: Medical Image Grounding (MIG) involves localizing specific regions in medical images based on textual descriptions.<n>Existing Vision-Language Models (VLMs) for MIG often rely on Supervised Fine-Tuning (SFT) with large amounts of Chain-of-Thought (CoT) reasoning annotations.<n>We propose the Spatial-Semantic Rewarded Group Relative Policy Optimization to train the model without CoT reasoning annotations.
- Score: 19.70803794316208
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Medical Image Grounding (MIG), which involves localizing specific regions in medical images based on textual descriptions, requires models to not only perceive regions but also deduce spatial relationships of these regions. Existing Vision-Language Models (VLMs) for MIG often rely on Supervised Fine-Tuning (SFT) with large amounts of Chain-of-Thought (CoT) reasoning annotations, which are expensive and time-consuming to acquire. Recently, DeepSeek-R1 demonstrated that Large Language Models (LLMs) can acquire reasoning abilities through Group Relative Policy Optimization (GRPO) without requiring CoT annotations. In this paper, we adapt the GRPO reinforcement learning framework to VLMs for Medical Image Grounding. We propose the Spatial-Semantic Rewarded Group Relative Policy Optimization to train the model without CoT reasoning annotations. Specifically, we introduce Spatial-Semantic Rewards, which combine spatial accuracy reward and semantic consistency reward to provide nuanced feedback for both spatially positive and negative completions. Additionally, we propose to use the Chain-of-Box template, which integrates visual information of referring bounding boxes into the <think> reasoning process, enabling the model to explicitly reason about spatial regions during intermediate steps. Experiments on three datasets MS-CXR, ChestX-ray8, and M3D-RefSeg demonstrate that our method achieves state-of-the-art performance in Medical Image Grounding. Ablation studies further validate the effectiveness of each component in our approach. Code, checkpoints, and datasets are available at https://github.com/bio-mlhui/MedGround-R1
Related papers
- VLM-R$^3$: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought [51.43082554363725]
We introduce textbfVLM-R$3$ (textbfVisual textbfLanguage textbfModel with textbfRegion textbfRecognition and textbfReasoning), a framework that equips an MLLM with the ability to decide emph when additional visual evidence is needed.<n>Experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R$3$ sets a new
arXiv Detail & Related papers (2025-05-22T03:50:13Z) - Large Language Model with Region-guided Referring and Grounding for CT Report Generation [4.804660464589285]
Existing methods primarily only consider the global features of the entire volume.<n>We propose Reg2RG, the first region-guided referring and grounding framework for CT report generation.
arXiv Detail & Related papers (2024-11-23T12:25:06Z) - Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training [99.2891802841936]
We introduce the Med-ST framework for fine-grained spatial and temporal modeling.
For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views.
For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR)
arXiv Detail & Related papers (2024-05-30T03:15:09Z) - Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models [12.264115733611058]
The task of performing localization with textual guidance is commonly referred to as phrase grounding.<n>We use a publicly available Foundation Model, namely the Latent Diffusion Model, to perform this challenging task.<n>Results on a popular chest X-ray benchmark indicate that our method is competitive with SOTA on different types of pathology.
arXiv Detail & Related papers (2024-04-19T14:43:48Z) - Contrastive Region Guidance: Improving Grounding in Vision-Language
Models without Training [79.27663870280038]
We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source vision-language models to respond to visual prompts.
When region annotations are provided, CRG increases absolute accuracy by up to 11.1% on ViP-Bench.
We also show CRG's applicability to spatial reasoning, with 10% improvement on What'sUp.
arXiv Detail & Related papers (2024-03-04T18:55:30Z) - Bidirectional Domain Mixup for Domain Adaptive Semantic Segmentation [73.3083304858763]
This paper systematically studies the impact of mixup under the domain adaptaive semantic segmentation task.
In specific, we achieve domain mixup in two-step: cut and paste.
We provide extensive ablation experiments to empirically verify our main components of the framework.
arXiv Detail & Related papers (2023-03-17T05:22:44Z) - Keep Your Friends Close & Enemies Farther: Debiasing Contrastive
Learning with Spatial Priors in 3D Radiology Images [11.251405818285331]
We propose a 3D contrastive framework (Spade) that leverages extracted correspondences to select more effective positive & negative samples for representation learning.
Compared to recent state-of-the-art approaches, Spade shows notable improvements on three downstream segmentation tasks.
arXiv Detail & Related papers (2022-11-16T03:36:06Z) - CPRAL: Collaborative Panoptic-Regional Active Learning for Semantic
Segmentation [35.11139361684248]
We propose a Collaborative Panoptic-Regional Active Learning framework (CPRAL) to address the semantic segmentation task.
Considering the class imbalance in the segmentation dataset, we import a Regional Gaussian Attention module (RGA) to achieve semantics-biased selection.
We show that CPRAL outperforms the cutting-edge methods with impressive results and less labeling proportion.
arXiv Detail & Related papers (2021-12-11T13:13:13Z) - Global Aggregation then Local Distribution for Scene Parsing [99.1095068574454]
We show that our approach can be modularized as an end-to-end trainable block and easily plugged into existing semantic segmentation networks.
Our approach allows us to build new state of the art on major semantic segmentation benchmarks including Cityscapes, ADE20K, Pascal Context, Camvid and COCO-stuff.
arXiv Detail & Related papers (2021-07-28T03:46:57Z) - PGL: Prior-Guided Local Self-supervised Learning for 3D Medical Image
Segmentation [87.50205728818601]
We propose a PriorGuided Local (PGL) self-supervised model that learns the region-wise local consistency in the latent feature space.
Our PGL model learns the distinctive representations of local regions, and hence is able to retain structural information.
arXiv Detail & Related papers (2020-11-25T11:03:11Z) - High-Order Information Matters: Learning Relation and Topology for
Occluded Person Re-Identification [84.43394420267794]
We propose a novel framework by learning high-order relation and topology information for discriminative features and robust alignment.
Our framework significantly outperforms state-of-the-art by6.5%mAP scores on Occluded-Duke dataset.
arXiv Detail & Related papers (2020-03-18T12:18:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.