BioD2C: A Dual-level Semantic Consistency Constraint Framework for Biomedical VQA
- URL: http://arxiv.org/abs/2503.02476v1
- Date: Tue, 04 Mar 2025 10:39:42 GMT
- Title: BioD2C: A Dual-level Semantic Consistency Constraint Framework for Biomedical VQA
- Authors: Zhengyang Ji, Shang Gao, Li Liu, Yifan Jia, Yutao Yue,
- Abstract summary: We propose BioD2C: a novel Dual-level Semantic Consistency Constraint Framework for Biomedical VQA.<n>BioD2C achieves dual-level semantic interaction alignment at both the model and feature levels, enabling the model to adaptively learn visual features based on the question.<n>In this work, we establish a new dataset, BioVGQ, to address inherent biases in prior datasets by filtering manually-altered images and aligning question-answer pairs with multimodal context.
- Score: 5.840467499436581
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Biomedical visual question answering (VQA) has been widely studied and has demonstrated significant application value and potential in fields such as assistive medical diagnosis. Despite their success, current biomedical VQA models perform multimodal information interaction only at the model level within large language models (LLMs), leading to suboptimal multimodal semantic alignment when dealing with complex tasks. To address this issue, we propose BioD2C: a novel Dual-level Semantic Consistency Constraint Framework for Biomedical VQA, which achieves dual-level semantic interaction alignment at both the model and feature levels, enabling the model to adaptively learn visual features based on the question. Specifically, we firstly integrate textual features into visual features via an image-text fusion mechanism as feature-level semantic interaction, obtaining visual features conditioned on the given text; and then introduce a text-queue-based cross-modal soft semantic loss function to further align the image semantics with the question semantics. Specifically, in this work, we establish a new dataset, BioVGQ, to address inherent biases in prior datasets by filtering manually-altered images and aligning question-answer pairs with multimodal context, and train our model on this dataset. Extensive experimental results demonstrate that BioD2C achieves state-of-the-art (SOTA) performance across multiple downstream datasets, showcasing its robustness, generalizability, and potential to advance biomedical VQA research.
Related papers
- Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine [9.881981672848598]
We introduce a novel end-to-end multimodal large language model for the biomedical domain, named MedPLIB.<n>It supports visual question answering (VQA), arbitrary pixel-level prompts (points, bounding boxes, and free-form shapes), and pixel-level grounding.<n>Results indicate that MedPLIB has achieved state-of-the-art outcomes across multiple medical visual language tasks.
arXiv Detail & Related papers (2024-12-12T13:41:35Z) - ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports.
Based on this dataset, we focus on the challanging task of unsupervised pretraining.
We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z) - PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation [51.509573838103854]
We propose a semi-supervised learning framework, termed Progressive Mean Teachers (PMT), for medical image segmentation.
Our PMT generates high-fidelity pseudo labels by learning robust and diverse features in the training process.
Experimental results on two datasets with different modalities, i.e., CT and MRI, demonstrate that our method outperforms the state-of-the-art medical image segmentation approaches.
arXiv Detail & Related papers (2024-09-08T15:02:25Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction [3.2274401541163322]
We propose a memory-efficient multimodal Transformer that can model interactions between pathway and histology patch tokens.
Our proposed model, SURVPATH, achieves state-of-the-art performance when evaluated against both unimodal and multimodal baselines.
arXiv Detail & Related papers (2023-04-13T21:02:32Z) - RAMM: Retrieval-augmented Biomedical Visual Question Answering with
Multi-modal Pre-training [45.38823400370285]
Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA)
In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA.
arXiv Detail & Related papers (2023-03-01T14:21:19Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - MuVAM: A Multi-View Attention-based Model for Medical Visual Question
Answering [2.413694065650786]
This paper proposes a multi-view attention-based model(MuVAM) for medical visual question answering.
It integrates the high-level semantics of medical images on the basis of text description.
Experiments on two datasets show that the effectiveness of MuVAM surpasses the state-of-the-art method.
arXiv Detail & Related papers (2021-07-07T13:40:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.