Related papers: Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion

Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion

URL: http://arxiv.org/abs/2504.03135v2
Date: Thu, 10 Apr 2025 11:52:40 GMT
Title: Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion
Authors: Junkai Zhang, Bin Li, Shoujun Zhou, Yue Du,
Abstract summary: Medical Visual Question Answering (Med-VQA) answers clinical questions using medical images, aiding diagnosis.<n>This study proposes a HiCA-VQA method, including two modules: Hierarchical Prompting for fine-grained medical questions and Hierarchical Answer Decoders.<n> Experiments on the Rad-Restruct benchmark demonstrate that the HiCA-VQA framework better outperforms existing state-of-the-art methods in answering hierarchical fine-grained questions.
Score: 4.821565717653691
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Medical Visual Question Answering (Med-VQA) answers clinical questions using medical images, aiding diagnosis. Designing the MedVQA system holds profound importance in assisting clinical diagnosis and enhancing diagnostic accuracy. Building upon this foundation, Hierarchical Medical VQA extends Medical VQA by organizing medical questions into a hierarchical structure and making level-specific predictions to handle fine-grained distinctions. Recently, many studies have proposed hierarchical MedVQA tasks and established datasets, However, several issues still remain: (1) imperfect hierarchical modeling leads to poor differentiation between question levels causing semantic fragmentation across hierarchies. (2) Excessive reliance on implicit learning in Transformer-based cross-modal self-attention fusion methods, which obscures crucial local semantic correlations in medical scenarios. To address these issues, this study proposes a HiCA-VQA method, including two modules: Hierarchical Prompting for fine-grained medical questions and Hierarchical Answer Decoders. The hierarchical prompting module pre-aligns hierarchical text prompts with image features to guide the model in focusing on specific image regions according to question types, while the hierarchical decoder performs separate predictions for questions at different levels to improve accuracy across granularities. The framework also incorporates a cross-attention fusion module where images serve as queries and text as key-value pairs. Experiments on the Rad-Restruct benchmark demonstrate that the HiCA-VQA framework better outperforms existing state-of-the-art methods in answering hierarchical fine-grained questions. This study provides an effective pathway for hierarchical visual question answering systems, advancing medical image understanding.

Related papers

Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation [56.52520416420957]
We propose Multimodal Causal-Driven Representation Learning (MCDRL) to tackle domain generalization in medical image segmentation.<n>MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
arXiv Detail & Related papers (2025-08-07T03:41:41Z)
Cultivating Multimodal Intelligence: Interpretive Reasoning and Agentic RAG Approaches to Dermatological Diagnosis [0.0]
The second edition of the 2025 ImageCLEF MEDIQA-MAGIC challenge focuses on multimodal dermatology question answering and segmentation.<n>This work addresses the Closed Visual Question Answering (CVQA) task, where the goal is to select the correct answer to multiple-choice clinical questions.<n>The team achieved second place with a submission that scored sixth, demonstrating competitive performance and high accuracy.
arXiv Detail & Related papers (2025-07-07T22:31:56Z)
MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning [24.9872402922819]
Existing medical VQA benchmarks mostly focus on single-image analysis.<n>We introduce MedFrameQA -- the first benchmark that explicitly evaluates multi-image reasoning in medical VQA.
arXiv Detail & Related papers (2025-05-22T17:46:11Z)
MedCoT: Medical Chain of Thought via Hierarchical Expert [48.91966620985221]
This paper presents MedCoT, a novel hierarchical expert verification reasoning chain method.<n>It is designed to enhance interpretability and accuracy in biomedical imaging inquiries.<n> Experimental evaluations on four standard Med-VQA datasets demonstrate that MedCoT surpasses existing state-of-the-art approaches.
arXiv Detail & Related papers (2024-12-18T11:14:02Z)
GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis [44.76975131560712]
We introduce a large-scale, Groundable, and Explainable Medical VQA benchmark for chest X-ray diagnosis (GEMeX)<n>With 151,025 images and 1,605,575 questions, GEMeX is the currently largest chest X-ray VQA dataset.
arXiv Detail & Related papers (2024-11-25T07:36:46Z)
Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering [45.058569118999436]
Given a pair of main and reference images, this task attempts to answer several questions on both diseases. We collect a new dataset, namely MIMIC-Diff-VQA, including 700,703 QA pairs from 164,324 pairs of main and reference images.
arXiv Detail & Related papers (2023-07-22T05:34:18Z)
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [56.25766322554655]
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery. We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model. We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef 2019.
arXiv Detail & Related papers (2023-05-17T17:50:16Z)
Medical visual question answering using joint self-supervised learning [8.817054025763325]
The encoder embeds across the image-text dual modalities with self-attention mechanism. The decoder is connected to the top of the encoder and fine-tuned using the small-sized medical VQA dataset.
arXiv Detail & Related papers (2023-02-25T12:12:22Z)
Q-Net: Query-Informed Few-Shot Medical Image Segmentation [5.615188751640673]
We propose a Query-informed Meta-FSS approach, which mimics the learning mechanism of an expert clinician. We build Q-Net based on ADNet, a recently proposed anomaly detection-inspired method. Q-Net achieves state-of-the-art performance on two widely used datasets.
arXiv Detail & Related papers (2022-08-24T11:36:53Z)
Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching. We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders. We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z)
MuVAM: A Multi-View Attention-based Model for Medical Visual Question Answering [2.413694065650786]
This paper proposes a multi-view attention-based model(MuVAM) for medical visual question answering. It integrates the high-level semantics of medical images on the basis of text description. Experiments on two datasets show that the effectiveness of MuVAM surpasses the state-of-the-art method.
arXiv Detail & Related papers (2021-07-07T13:40:25Z)
Inheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis [50.15205065710629]
Clinical diagnosis, which aims to assign diagnosis codes for a patient based on the clinical note, plays an essential role in clinical decision-making. We propose a novel framework to combine the inheritance-guided hierarchical assignment and co-occurrence graph propagation for clinical automatic diagnosis.
arXiv Detail & Related papers (2021-01-27T13:16:51Z)
Hierarchical Deep Multi-modal Network for Medical Visual Question Answering [25.633660028022195]
We propose a hierarchical deep multi-modal network that analyzes and classifies end-user questions/queries. We integrate the QS model to the hierarchical deep multi-modal neural network to generate proper answers to the queries related to medical images.
arXiv Detail & Related papers (2020-09-27T07:24:41Z)
Interpretable Multi-Step Reasoning with Knowledge Extraction on Complex Healthcare Question Answering [89.76059961309453]
HeadQA dataset contains multiple-choice questions authorized for the public healthcare specialization exam. These questions are the most challenging for current QA systems. We present a Multi-step reasoning with Knowledge extraction framework (MurKe) We are striving to make full use of off-the-shelf pre-trained models.
arXiv Detail & Related papers (2020-08-06T02:47:46Z)
Learning Contextualized Document Representations for Healthcare Answer Retrieval [68.02029435111193]
Contextual Discourse Vectors (CDV) is a distributed document representation for efficient answer retrieval from long documents. Our model leverages a dual encoder architecture with hierarchical LSTM layers and multi-task training to encode the position of clinical entities and aspects alongside the document discourse. We show that our generalized model significantly outperforms several state-of-the-art baselines for healthcare passage ranking.
arXiv Detail & Related papers (2020-02-03T15:47:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.