MedGround: Bridging the Evidence Gap in Medical Vision-Language Models with Verified Grounding Data
- URL: http://arxiv.org/abs/2601.06847v1
- Date: Sun, 11 Jan 2026 10:34:18 GMT
- Title: MedGround: Bridging the Evidence Gap in Medical Vision-Language Models with Verified Grounding Data
- Authors: Mengmeng Zhang, Xiaoping Wu, Hao Luo, Fan Wang, Yisheng Lv,
- Abstract summary: We introduce MedGround, an automated pipeline that transforms segmentation resources into high-quality medical referring grounding data.<n>We also present MedGround-35K, a novel multimodal medical dataset.
- Score: 32.65971100171597
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Models (VLMs) can generate convincing clinical narratives, yet frequently struggle to visually ground their statements. We posit this limitation arises from the scarcity of high-quality, large-scale clinical referring-localization pairs. To address this, we introduce MedGround, an automated pipeline that transforms segmentation resources into high-quality medical referring grounding data. Leveraging expert masks as spatial anchors, MedGround precisely derives localization targets, extracts shape and spatial cues, and guides VLMs to synthesize natural, clinically grounded queries that reflect morphology and location. To ensure data rigor, a multi-stage verification system integrates strict formatting checks, geometry- and medical-prior rules, and image-based visual judging to filter out ambiguous or visually unsupported samples. Finally, we present MedGround-35K, a novel multimodal medical dataset. Extensive experiments demonstrate that VLMs trained with MedGround-35K consistently achieve improved referring grounding performance, enhance multi-object semantic disambiguation, and exhibit strong generalization to unseen grounding settings. This work highlights MedGround as a scalable, data-driven approach to anchor medical language to verifiable visual evidence. Dataset and code will be released publicly upon acceptance.
Related papers
- RAU: Reference-based Anatomical Understanding with Vision Language Models [26.06602931463068]
We introduce RAU, a framework for reference-based anatomical understanding with vision-language models (VLMs)<n>We first show that a VLM learns to identify anatomical regions through relative spatial reasoning between reference and target images.<n>Next, we demonstrate that the VLM-derived spatial cues can be seamlessly integrated with the fine-grained segmentation capability of SAM2.
arXiv Detail & Related papers (2025-09-26T14:32:03Z) - Med-GLIP: Advancing Medical Language-Image Pre-training with Large-scale Grounded Dataset [18.29385508780721]
Med-GLIP is a modality-aware grounding framework trained on Med-GLIP-5M.<n>It implicitly acquires hierarchical semantic understanding from diverse training data.<n>It consistently outperforms state-of-the-art baselines across multiple grounding benchmarks.
arXiv Detail & Related papers (2025-08-14T11:02:38Z) - Describe Anything in Medical Images [32.785523415007]
We propose MedDAM, the first comprehensive framework leveraging large vision-language models for region-specific captioning in medical images.<n>MedDAM employs medical expert-designed prompts tailored to specific imaging modalities and establishes a robust evaluation benchmark.<n>This benchmark evaluates both MedDAM and other large vision-language models, focusing on clinical factuality through attribute-level verification tasks.
arXiv Detail & Related papers (2025-05-09T05:45:31Z) - MedicalNarratives: Connecting Medical Vision and Language with Localized Narratives [11.242775987217032]
MedicalNarratives is a dataset curated from medical pedagogical videos similar in nature to data collected in Think-Aloud studies.<n>Our dataset contains 4.7M image-text pairs from videos and articles, with 1M samples containing dense annotations in the form of traces and bounding boxes.<n>To evaluate the utility of MedicalNarratives, we train GenMedClip based on the CLIP architecture using our dataset spanning 12 medical domains.
arXiv Detail & Related papers (2025-01-07T23:32:05Z) - A Survey of Medical Vision-and-Language Applications and Their Techniques [48.268198631277315]
Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data.
Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied.
We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics.
arXiv Detail & Related papers (2024-11-19T03:27:05Z) - ExGra-Med: Extended Context Graph Alignment for Medical Vision-Language Models [95.47808515575382]
ExGra-Med is a novel framework for vision-language integration in medical AI.<n>It aligns images, instruction responses, and extended captions in the latent space, advancing semantic grounding and cross-modal coherence.<n>It matches LLaVA-Med's performance using just 10% of the pre-training data, achieving a 20.13% gain on VQA-RAD and approaching full-data performance.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding [72.18719355481052]
We introduce a novel task called Medical Report Grounding (MRG)<n>MRG aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner.<n>We propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases.
arXiv Detail & Related papers (2024-04-10T07:41:35Z) - Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features.
We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z) - C^2M-DoT: Cross-modal consistent multi-view medical report generation
with domain transfer network [67.97926983664676]
We propose a cross-modal consistent multi-view medical report generation with a domain transfer network (C2M-DoT)
C2M-DoT substantially outperforms state-of-the-art baselines in all metrics.
arXiv Detail & Related papers (2023-10-09T02:31:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.