DualPrompt-MedCap: A Dual-Prompt Enhanced Approach for Medical Image Captioning
- URL: http://arxiv.org/abs/2504.09598v1
- Date: Sun, 13 Apr 2025 14:31:55 GMT
- Title: DualPrompt-MedCap: A Dual-Prompt Enhanced Approach for Medical Image Captioning
- Authors: Yining Zhao, Ali Braytee, Mukesh Prasad,
- Abstract summary: We present DualPrompt-MedCap, a novel dual-prompt enhancement framework that augments Large Vision-Language Models (LVLMs)<n>A modality-aware prompt derived from a semi-supervised classification model pretrained on medical question-answer pairs, and a question-guided prompt leveraging biomedical language model embeddings.<n>Our method enables the generation of clinically accurate reports that can serve as medical experts' prior knowledge and automatic annotations for downstream vision-language tasks.
- Score: 5.456249017636404
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Medical image captioning via vision-language models has shown promising potential for clinical diagnosis assistance. However, generating contextually relevant descriptions with accurate modality recognition remains challenging. We present DualPrompt-MedCap, a novel dual-prompt enhancement framework that augments Large Vision-Language Models (LVLMs) through two specialized components: (1) a modality-aware prompt derived from a semi-supervised classification model pretrained on medical question-answer pairs, and (2) a question-guided prompt leveraging biomedical language model embeddings. To address the lack of captioning ground truth, we also propose an evaluation framework that jointly considers spatial-semantic relevance and medical narrative quality. Experiments on multiple medical datasets demonstrate that DualPrompt-MedCap outperforms the baseline BLIP-3 by achieving a 22% improvement in modality recognition accuracy while generating more comprehensive and question-aligned descriptions. Our method enables the generation of clinically accurate reports that can serve as medical experts' prior knowledge and automatic annotations for downstream vision-language tasks.
Related papers
- An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training [40.16314726875265]
ConceptCLIP is the first explainable biomedical foundation model that achieves state-of-the-art diagnostic accuracy.
We develop ConceptCLIP through a novel dual-alignment approach that simultaneously learns global image-text representations and fine-grained region-concept associations.
arXiv Detail & Related papers (2025-01-26T16:07:11Z) - Efficient Few-Shot Medical Image Analysis via Hierarchical Contrastive Vision-Language Learning [44.99833362998488]
We propose Adaptive Vision-Language Fine-tuning with Hierarchical Contrastive Alignment (HiCA) for medical image analysis.<n>HiCA combines domain-specific pretraining and hierarchical contrastive learning to align visual and textual representations at multiple levels.<n>We evaluate our approach on two benchmark datasets, Chest X-ray and Breast Ultrasound.
arXiv Detail & Related papers (2025-01-16T05:01:30Z) - Exploring Low-Resource Medical Image Classification with Weakly
Supervised Prompt Learning [21.604146757986765]
Existing pre-trained vision-language models require domain experts to carefully design the medical prompts.
We propose a weakly supervised prompt learning method MedPrompt to automatically generate medical prompts.
We show that the model using our automatically generated prompts outperforms its full-shot learning hand-crafted prompts.
arXiv Detail & Related papers (2024-02-06T07:53:23Z) - Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning
for Medical Image Captioning [12.10183458424711]
We present a novel medical image captioning method guided by the segment anything model (SAM)
Our approach employs a distinctive pre-training strategy with mixed semantic learning to simultaneously capture both the overall information and finer details within medical images.
arXiv Detail & Related papers (2023-11-02T05:44:13Z) - Robust and Interpretable Medical Image Classifiers via Concept
Bottleneck Models [49.95603725998561]
We propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts.
Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model.
arXiv Detail & Related papers (2023-10-04T21:57:09Z) - Customizing General-Purpose Foundation Models for Medical Report
Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z) - UnICLAM:Contrastive Representation Learning with Adversarial Masking for
Unified and Interpretable Medical Vision Question Answering [7.2486693553383805]
Current Medical-VQA models learn cross-modal representations through residing vision and texture encoders in dual separate spaces.
We propose UnICLAM, a Unified and Interpretable Medical-VQA model through Contrastive Representation Learning with Adversarial Masking.
Experimental results on VQA-RAD and SLAKE public benchmarks demonstrate that UnICLAM outperforms existing 11 state-of-the-art Medical-VQA models.
arXiv Detail & Related papers (2022-12-21T02:48:15Z) - Align, Reason and Learn: Enhancing Medical Vision-and-Language
Pre-training with Knowledge [68.90835997085557]
We propose a systematic and effective approach to enhance structured medical knowledge from three perspectives.
First, we align the representations of the vision encoder and the language encoder through knowledge.
Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text.
Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks.
arXiv Detail & Related papers (2022-09-15T08:00:01Z) - Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation [116.87918100031153]
We propose a Cross-modal clinical Graph Transformer (CGT) for ophthalmic report generation (ORG)
CGT injects clinical relation triples into the visual features as prior knowledge to drive the decoding procedure.
Experiments on the large-scale FFA-IR benchmark demonstrate that the proposed CGT is able to outperform previous benchmark methods.
arXiv Detail & Related papers (2022-06-04T13:16:30Z) - Semi-Supervised Variational Reasoning for Medical Dialogue Generation [70.838542865384]
Two key characteristics are relevant for medical dialogue generation: patient states and physician actions.
We propose an end-to-end variational reasoning approach to medical dialogue generation.
A physician policy network composed of an action-classifier and two reasoning detectors is proposed for augmented reasoning ability.
arXiv Detail & Related papers (2021-05-13T04:14:35Z) - MedDG: An Entity-Centric Medical Consultation Dataset for Entity-Aware
Medical Dialogue Generation [86.38736781043109]
We build and release a large-scale high-quality Medical Dialogue dataset related to 12 types of common Gastrointestinal diseases named MedDG.
We propose two kinds of medical dialogue tasks based on MedDG dataset. One is the next entity prediction and the other is the doctor response generation.
Experimental results show that the pre-train language models and other baselines struggle on both tasks with poor performance in our dataset.
arXiv Detail & Related papers (2020-10-15T03:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.