Related papers: GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning

GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning

URL: http://arxiv.org/abs/2506.17939v1
Date: Sun, 22 Jun 2025 08:09:58 GMT
Title: GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning
Authors: Bo Liu, Xiangyu Zhao, Along He, Yidi Chen, Huazhu Fu, Xiao-Ming Wu,
Abstract summary: Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images.<n>Current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model-generated answers.<n>This work first proposes a Thinking with Visual Grounding dataset wherein the answer generation is decomposed into intermediate reasoning steps.<n>We introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer.
Score: 50.94508930739623
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images. While recent advances in multi-modal learning have significantly improved performance, current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model-generated answers. To address this, this work first proposes a Thinking with Visual Grounding (ThinkVG) dataset wherein the answer generation is decomposed into intermediate reasoning steps that explicitly ground relevant visual regions of the medical image, thereby providing fine-grained explainability. Furthermore, we introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer. Remarkably, our method achieves comparable performance using only one-eighth of the training data, demonstrating the efficiency and effectiveness of the proposal. The dataset is available at https://huggingface.co/datasets/BoKelvin/GEMeX-ThinkVG.

Related papers

Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering [58.79671189792399]
STLLaVA-Med is designed to train a policy model capable of auto-generating medical visual instruction data. We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks.
arXiv Detail & Related papers (2024-06-28T15:01:23Z)
CoMT: Chain-of-Medical-Thought Reduces Hallucination in Medical Report Generation [20.59298361626719]
We propose a chain-of-medical-thought approach (CoMT) to mitigate hallucinations in medical report generation.<n>CoMT intends to imitate the cognitive process of human doctors by decomposing diagnostic procedures.
arXiv Detail & Related papers (2024-06-17T12:03:32Z)
MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale [19.94415334436024]
We devise a semi-automated annotation process to streamline data preparation and build new benchmark MedVQA datasets. These datasets provide intermediate medical decision-making rationales generated by multimodal large language models and human annotations. We also design a novel framework, MedThink, which finetunes lightweight pretrained generative models by incorporating medical decision-making rationales.
arXiv Detail & Related papers (2024-04-18T17:53:19Z)
Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding [72.18719355481052]
We introduce a novel task called Medical Report Grounding (MRG)<n>MRG aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner.<n>We propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases.
arXiv Detail & Related papers (2024-04-10T07:41:35Z)
Robust and Interpretable Medical Image Classifiers via Concept Bottleneck Models [49.95603725998561]
We propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts. Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model.
arXiv Detail & Related papers (2023-10-04T21:57:09Z)
Towards Answering Health-related Questions from Medical Videos: Datasets and Approaches [21.16331827504689]
A growing number of individuals now prefer instructional videos as they offer a series of step-by-step procedures to accomplish particular tasks. The instructional videos from the medical domain may provide the best possible visual answers to first aid, medical emergency, and medical education questions. The scarcity of large-scale datasets in the medical domain is a key challenge that hinders the development of applications that can help the public with their health-related questions.
arXiv Detail & Related papers (2023-09-21T16:21:28Z)
Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text. The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z)
Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models [42.360431316298204]
We focus on open-ended VQA and motivated by the recent advances in language models consider it as a generative task. To properly communicate the medical images to the language model, we develop a network that maps the extracted visual features to a set of learnable tokens. We evaluate our approach on the prime medical VQA benchmarks, namely, Slake, OVQA and PathVQA.
arXiv Detail & Related papers (2023-03-10T15:17:22Z)
Consistency-preserving Visual Question Answering in Medical Imaging [2.005299372367689]
Visual Question Answering (VQA) models take an image and a natural-language question as input and infer the answer to the question. We propose a novel loss function and corresponding training procedure that allows the inclusion of relations between questions into the training process. Our experiments show that our method outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2022-06-27T13:38:50Z)
Explaining Clinical Decision Support Systems in Medical Imaging using Cycle-Consistent Activation Maximization [112.2628296775395]
Clinical decision support using deep neural networks has become a topic of steadily growing interest. clinicians are often hesitant to adopt the technology because its underlying decision-making process is considered to be intransparent and difficult to comprehend. We propose a novel decision explanation scheme based on CycleGAN activation which generates high-quality visualizations of classifier decisions even in smaller data sets.
arXiv Detail & Related papers (2020-10-09T14:39:27Z)
A Question-Centric Model for Visual Question Answering in Medical Imaging [3.619444603816032]
We present a novel Visual Question Answering approach that allows an image to be queried by means of a written question. Experiments on a variety of medical and natural image datasets show that by fusing image and question features in a novel way, the proposed approach achieves an equal or higher accuracy compared to current methods.
arXiv Detail & Related papers (2020-03-02T10:16:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.