ClinKD: Cross-Modal Clinic Knowledge Distiller For Multi-Task Medical Images
- URL: http://arxiv.org/abs/2502.05928v1
- Date: Sun, 09 Feb 2025 15:08:10 GMT
- Title: ClinKD: Cross-Modal Clinic Knowledge Distiller For Multi-Task Medical Images
- Authors: Hongyu Ge, Longkun Hao, Zihui Xu, Zhenxin Lin, Bin Li, Shoujun Zhou, Hongjin Zhao, Yihang Liu,
- Abstract summary: Med-VQA (Medical Visual Question Answering) is a crucial subtask within the broader VQA (Visual Question Answering) domain.
We introduce the ClinKD model, which incorporates modifications to model position encoding and a diversified training process.
We achieve a new state-of-the-art performance on the Med-GRIT-270k dataset.
- Score: 4.353855760968461
- License:
- Abstract: Med-VQA (Medical Visual Question Answering) is a crucial subtask within the broader VQA (Visual Question Answering) domain. This task requires a visual question answering system to analyze the provided image and corresponding question,offering reasonable analysis and suggestions to assist medical professionals in making pathological diagnoses, or ideally, enabling the system to independently provide correct diagnoses. Furthermore, more advanced Med-VQA tasks involve Referring and Grounding, which not only require the system to accurately comprehend medical images but also to pinpoint specific biological locations within those images. While many large pre-trained models have demonstrated substantial VQA capabilities,challenges persist in the medical imaging domain. The intricacy of biological features in medical images and the scarcity of high-quality medical image datasets, combined with the fact that current models are not tailored for the medical field in terms of architecture and training paradigms, hinder the full exploitation of model generalization. This results in issues such as hallucination in Visual Grounding. In this paper, we introduce the ClinKD model, which incorporates modifications to model position encoding and a diversified training process. Initially, we enhance the model's ability to perceive image and modality variations by using Med-CLIP Guided Rotary Position Embedding. Subsequently, we leverage distillation to provide prior knowledge to the model before using complete training data. Additionally, the feedback-based training process during the formal training phase further enhances data utilization. Notably, under unchanged evaluation protocols, we achieve a new state-of-the-art performance on the Med-GRIT-270k dataset, and the Med-CLIP Guided Rotary Position Embedding approach presents potential for generalizing to universal model position encoding.
Related papers
- Vision Foundation Models in Medical Image Analysis: Advances and Challenges [7.224426395050136]
Vision Foundation Models (VFMs) have sparked significant advances in the field of medical image analysis.
This paper reviews the state-of-the-art research on the adaptation of VFMs to medical image segmentation.
We discuss the latest developments in adapter-based improvements, knowledge distillation techniques, and multi-scale contextual feature modeling.
arXiv Detail & Related papers (2025-02-20T14:13:46Z) - Towards a vision foundation model for comprehensive assessment of Cardiac MRI [11.838157772803282]
We introduce a vision foundation model trained for cardiac magnetic resonance imaging (CMR) assessment.
We finetune the model in supervised way for 9 clinical tasks typical to a CMR workflow.
We demonstrate improved accuracy and robustness across all tasks, over a range of available labeled dataset sizes.
arXiv Detail & Related papers (2024-10-02T15:32:01Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text.
The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [56.25766322554655]
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery.
We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model.
We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef 2019.
arXiv Detail & Related papers (2023-05-17T17:50:16Z) - A Trustworthy Framework for Medical Image Analysis with Deep Learning [71.48204494889505]
TRUDLMIA is a trustworthy deep learning framework for medical image analysis.
It is anticipated that the framework will support researchers and clinicians in advancing the use of deep learning for dealing with public health crises including COVID-19.
arXiv Detail & Related papers (2022-12-06T05:30:22Z) - Medical Image Understanding with Pretrained Vision Language Models: A
Comprehensive Study [8.547751745702156]
We show that well-designed medical prompts are the key to elicit knowledge from pre-trained vision language models (VLM)
We develop three approaches for automatic generation of medical prompts, which can inject expert-level medical knowledge and image-specific information into the prompts for fine-grained grounding.
arXiv Detail & Related papers (2022-09-30T15:06:13Z) - Domain Generalization on Medical Imaging Classification using Episodic
Training with Task Augmentation [62.49837463676111]
We propose a novel scheme of episodic training with task augmentation on medical imaging classification.
Motivated by the limited number of source domains in real-world medical deployment, we consider the unique task-level overfitting.
arXiv Detail & Related papers (2021-06-13T03:56:59Z) - Explaining Clinical Decision Support Systems in Medical Imaging using
Cycle-Consistent Activation Maximization [112.2628296775395]
Clinical decision support using deep neural networks has become a topic of steadily growing interest.
clinicians are often hesitant to adopt the technology because its underlying decision-making process is considered to be intransparent and difficult to comprehend.
We propose a novel decision explanation scheme based on CycleGAN activation which generates high-quality visualizations of classifier decisions even in smaller data sets.
arXiv Detail & Related papers (2020-10-09T14:39:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.