Efficient Bilinear Attention-based Fusion for Medical Visual Question Answering
- URL: http://arxiv.org/abs/2410.21000v3
- Date: Sun, 11 May 2025 14:45:34 GMT
- Title: Efficient Bilinear Attention-based Fusion for Medical Visual Question Answering
- Authors: Zhilin Zhang, Jie Wang, Zhanghao Qin, Ruiqi Zhu, Xiaoliang Gong,
- Abstract summary: Medical Visual Question Answering (MedVQA) has attracted growing interest at the intersection of medical image understanding and natural language processing for clinical applications.<n>We introduce a fusion model, OMniBAN, that integrates Orthogonality loss, Multi-head attention, and a Bilinear Attention Network to achieve high computational efficiency as well as solid performance.
- Score: 3.7133600776119136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical Visual Question Answering (MedVQA) has attracted growing interest at the intersection of medical image understanding and natural language processing for clinical applications. By interpreting medical images and providing precise answers to relevant clinical inquiries, MedVQA has the potential to support diagnostic decision-making and reduce workload across various fields like radiology. While recent approaches rely heavily on unified large pre-trained Visual-Language Models, research on more efficient fusion mechanisms remains relatively limited in this domain. In this paper, we introduce a fusion model, OMniBAN, that integrates Orthogonality loss, Multi-head attention, and a Bilinear Attention Network to achieve high computational efficiency as well as solid performance. We conduct comprehensive experiments and demonstrate how bilinear attention fusion can approximate the performance of larger fusion models like cross-modal Transformer. Our results show that OMniBAN requires fewer parameters (approximately 2/3 of Transformer-based Co-Attention) and substantially lower FLOPs (approximately 1/4), while achieving comparable overall performance and even slight improvements on closed-ended questions on two key MedVQA benchmarks. This balance between efficiency and accuracy suggests that OMniBAN could be a viable option for real-world medical image question answering, where computational resources are often constrained.
Related papers
- NEARL-CLIP: Interacted Query Adaptation with Orthogonal Regularization for Medical Vision-Language Understanding [51.63264715941068]
textbfNEARL-CLIP (iunderlineNteracted quunderlineEry underlineAdaptation with ounderlineRthogonaunderlineL Regularization) is a novel cross-modality interaction VLM-based framework.
arXiv Detail & Related papers (2025-08-06T05:44:01Z) - ClinicalFMamba: Advancing Clinical Assessment using Mamba-based Multimodal Neuroimaging Fusion [7.0879234284391455]
Multimodal medical image fusion integrates complementary information from different imaging modalities to enhance diagnostic accuracy and treatment planning.<n>CNNs excel at local feature extraction but struggle to model global context effectively.<n>Transformers achieve superior long-range modeling at the cost of quadratic computational complexity.<n>Recent State Space Models (SSMs) offer a promising alternative.<n>We propose ClinicalFMamba, a novel end-to-end CNN-Mamba hybrid architecture.
arXiv Detail & Related papers (2025-08-05T02:25:53Z) - GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning [50.94508930739623]
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images.<n>Current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model-generated answers.<n>This work first proposes a Thinking with Visual Grounding dataset wherein the answer generation is decomposed into intermediate reasoning steps.<n>We introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer.
arXiv Detail & Related papers (2025-06-22T08:09:58Z) - EVM-Fusion: An Explainable Vision Mamba Architecture with Neural Algorithmic Fusion [0.0]
EVM-Fusion is an Explainable Vision Mamba architecture featuring a novel Algorithm Neuralic Fusion (NAF) mechanism for medical image classification.<n>Experiments on a diverse 9-class multi-organ medical image dataset demonstrate EVM-Fusion's strong classification performance, achieving 99.75% test accuracy.
arXiv Detail & Related papers (2025-05-23T00:41:57Z) - Multi-Granularity Vision Fastformer with Fusion Mechanism for Skin Lesion Segmentation [7.944123371140182]
This research aims to optimize the balance between computational costs and long-range dependency modelling.<n>We propose a lightweight U-shape network that utilizes Vision Fastformer with Fusion Mechanism (VFFM-UNet)
arXiv Detail & Related papers (2025-04-04T01:27:43Z) - Multi-Omics Fusion with Soft Labeling for Enhanced Prediction of Distant Metastasis in Nasopharyngeal Carcinoma Patients after Radiotherapy [4.971538849792411]
One of the challenges encountered in the integration of omics data is the presence of unpredictability.
This study aims to develop a fusion methodology that mitigates the disparities inherent in omics data.
arXiv Detail & Related papers (2025-02-12T05:26:59Z) - ICFNet: Integrated Cross-modal Fusion Network for Survival Prediction [24.328576712419814]
We propose an Integrated Cross-modal Fusion Network (ICFNet) that integrates histopathology whole slide images, genomic expression profiles, patient demographics, and treatment protocols.
ICFNet outperforms state-of-the-art algorithms on five public TCGA datasets.
arXiv Detail & Related papers (2025-01-06T05:49:08Z) - Edge-Enhanced Dilated Residual Attention Network for Multimodal Medical Image Fusion [13.029564509505676]
Multimodal medical image fusion is a crucial task that combines complementary information from different imaging modalities into a unified representation.
While deep learning methods have significantly advanced fusion performance, some of the existing CNN-based methods fall short in capturing fine-grained multiscale and edge features.
We propose a novel CNN-based architecture that addresses these limitations by introducing a Dilated Residual Attention Network Module for effective multiscale feature extraction.
arXiv Detail & Related papers (2024-11-18T18:11:53Z) - Random Token Fusion for Multi-View Medical Diagnosis [2.3458652461211935]
In multi-view medical datasets, deep learning models often fuse information from different imaging perspectives to improve diagnosis performance.
Existing approaches are prone to overfitting and rely heavily on view-specific features, which can lead to trivial solutions.
In this work, we introduce a novel technique designed to enhance image analysis using multi-view medical transformers.
arXiv Detail & Related papers (2024-10-21T10:19:45Z) - Analyzing the Effect of $k$-Space Features in MRI Classification Models [0.0]
We have developed an explainable AI methodology tailored for medical imaging.
We employ a Convolutional Neural Network (CNN) that analyzes MRI scans across both image and frequency domains.
This approach not only enhances early training efficiency but also deepens our understanding of how additional features impact the model predictions.
arXiv Detail & Related papers (2024-09-20T15:43:26Z) - STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering [58.79671189792399]
STLLaVA-Med is designed to train a policy model capable of auto-generating medical visual instruction data.
We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks.
arXiv Detail & Related papers (2024-06-28T15:01:23Z) - HMANet: Hybrid Multi-Axis Aggregation Network for Image Super-Resolution [6.7341750484636975]
Transformer-based networks can only use input information from a limited spatial range.
A novel Hybrid Multi-Axis Aggregation network (HMA) is proposed in this paper to exploit feature potential information better.
The experimental results show that HMA outperforms the state-of-the-art methods on the benchmark dataset.
arXiv Detail & Related papers (2024-05-08T12:14:34Z) - Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology.
For training, we assemble a large dataset of over 697 thousand radiology image-text pairs.
For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation.
The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z) - MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder [26.830574964308962]
We introduce MedFLIP, a Fast Language-Image Pre-training method for Medical analysis.
We explore MAEs for zero-shot learning with crossed domains, which enhances the model's ability to learn from limited data.
Lastly, we validate using language will improve the zero-shot performance for the medical image analysis.
arXiv Detail & Related papers (2024-03-07T16:11:43Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - Free Form Medical Visual Question Answering in Radiology [3.495246564946556]
Research in medical Visual Question Answering has been scant, only gaining momentum since 2018.
Our research delves into the effective representation of radiology images and the joint learning of multimodal representations.
Our model achieves a top-1 accuracy of 79.55% with a less complex architecture, demonstrating comparable performance to current state-of-the-art models.
arXiv Detail & Related papers (2024-01-23T20:26:52Z) - XAI for In-hospital Mortality Prediction via Multimodal ICU Data [57.73357047856416]
We propose an efficient, explainable AI solution for predicting in-hospital mortality via multimodal ICU data.
We employ multimodal learning in our framework, which can receive heterogeneous inputs from clinical data and make decisions.
Our framework can be easily transferred to other clinical tasks, which facilitates the discovery of crucial factors in healthcare research.
arXiv Detail & Related papers (2023-12-29T14:28:04Z) - Radiology Report Generation Using Transformers Conditioned with
Non-imaging Data [55.17268696112258]
This paper proposes a novel multi-modal transformer network that integrates chest x-ray (CXR) images and associated patient demographic information.
The proposed network uses a convolutional neural network to extract visual features from CXRs and a transformer-based encoder-decoder network that combines the visual features with semantic text embeddings of patient demographic information.
arXiv Detail & Related papers (2023-11-18T14:52:26Z) - C^2M-DoT: Cross-modal consistent multi-view medical report generation
with domain transfer network [67.97926983664676]
We propose a cross-modal consistent multi-view medical report generation with a domain transfer network (C2M-DoT)
C2M-DoT substantially outperforms state-of-the-art baselines in all metrics.
arXiv Detail & Related papers (2023-10-09T02:31:36Z) - MedAugment: Universal Automatic Data Augmentation Plug-in for Medical Image Analysis [9.724228319915609]
Data augmentation (DA) has been widely leveraged in computer vision to alleviate the data shortage.
DA in medical image analysis (MIA) faces multiple challenges.
We propose an efficient and effective automatic DA method termed MedAugment.
arXiv Detail & Related papers (2023-06-30T08:22:48Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [56.25766322554655]
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery.
We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model.
We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef 2019.
arXiv Detail & Related papers (2023-05-17T17:50:16Z) - Cross-Modal Causal Intervention for Medical Report Generation [109.83549148448469]
Medical report generation (MRG) is essential for computer-aided diagnosis and medication guidance.
Due to the spurious correlations within image-text data induced by visual and linguistic biases, it is challenging to generate accurate reports reliably describing lesion areas.
We propose a novel Visual-Linguistic Causal Intervention (VLCI) framework for MRG, which consists of a visual deconfounding module (VDM) and a linguistic deconfounding module (LDM)
arXiv Detail & Related papers (2023-03-16T07:23:55Z) - MedSegDiff-V2: Diffusion based Medical Image Segmentation with
Transformer [53.575573940055335]
We propose a novel Transformer-based Diffusion framework, called MedSegDiff-V2.
We verify its effectiveness on 20 medical image segmentation tasks with different image modalities.
arXiv Detail & Related papers (2023-01-19T03:42:36Z) - Fusion of medical imaging and electronic health records with attention
and multi-head machanisms [4.433829714749366]
We propose a multi-modal attention module which use EHR data to help the selection of important regions during image feature extraction process.
We also propose to incorporate multi-head machnism to gated multimodal unit (GMU) to make it able to parallelly fuse image and EHR features in different subspaces.
Experiments on predicting Glasgow outcome scale (GOS) of intracerebral hemorrhage patients and classifying Alzheimer's Disease showed the proposed method can automatically focus on task-related areas.
arXiv Detail & Related papers (2021-12-22T07:39:26Z) - Incremental Cross-view Mutual Distillation for Self-supervised Medical
CT Synthesis [88.39466012709205]
This paper builds a novel medical slice to increase the between-slice resolution.
Considering that the ground-truth intermediate medical slices are always absent in clinical practice, we introduce the incremental cross-view mutual distillation strategy.
Our method outperforms state-of-the-art algorithms by clear margins.
arXiv Detail & Related papers (2021-12-20T03:38:37Z) - Context-Aware Refinement Network Incorporating Structural Connectivity
Prior for Brain Midline Delineation [50.868845400939314]
We propose a context-aware refinement network (CAR-Net) to refine and integrate the feature pyramid representation generated by the UNet.
For keeping the structural connectivity of the brain midline, we introduce a novel connectivity regular loss.
The proposed method requires fewer parameters and outperforms three state-of-the-art methods in terms of four evaluation metrics.
arXiv Detail & Related papers (2020-07-10T14:01:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.