Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning
for Medical Image Captioning
- URL: http://arxiv.org/abs/2311.01004v2
- Date: Sat, 30 Dec 2023 17:17:58 GMT
- Title: Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning
for Medical Image Captioning
- Authors: Zhenyu Zhang, Benlu Wang, Weijie Liang, Yizhi Li, Xuechen Guo,
Guanhong Wang, Shiyan Li, Gaoang Wang
- Abstract summary: We present a novel medical image captioning method guided by the segment anything model (SAM)
Our approach employs a distinctive pre-training strategy with mixed semantic learning to simultaneously capture both the overall information and finer details within medical images.
- Score: 12.10183458424711
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the development of multimodality and large language models, the deep
learning-based technique for medical image captioning holds the potential to
offer valuable diagnostic recommendations. However, current generic text and
image pre-trained models do not yield satisfactory results when it comes to
describing intricate details within medical images. In this paper, we present a
novel medical image captioning method guided by the segment anything model
(SAM) to enable enhanced encoding with both general and detailed feature
extraction. In addition, our approach employs a distinctive pre-training
strategy with mixed semantic learning to simultaneously capture both the
overall information and finer details within medical images. We demonstrate the
effectiveness of this approach, as it outperforms the pre-trained BLIP2 model
on various evaluation metrics for generating descriptions of medical images.
Related papers
- LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - CoBooM: Codebook Guided Bootstrapping for Medical Image Representation Learning [6.838695126692698]
Self-supervised learning has emerged as a promising paradigm for medical image analysis by harnessing unannotated data.
Existing SSL approaches overlook the high anatomical similarity inherent in medical images.
We propose CoBooM, a novel framework for self-supervised medical image learning by integrating continuous and discrete representations.
arXiv Detail & Related papers (2024-08-08T06:59:32Z) - MLIP: Enhancing Medical Visual Representation with Divergence Encoder
and Knowledge-guided Contrastive Learning [48.97640824497327]
We propose a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning.
Our model includes global contrastive learning with our designed divergence encoder, local token-knowledge-patch alignment contrastive learning, and knowledge-guided category-level contrastive learning with expert knowledge.
Notably, MLIP surpasses state-of-the-art methods even with limited annotated data, highlighting the potential of multimodal pre-training in advancing medical representation learning.
arXiv Detail & Related papers (2024-02-03T05:48:50Z) - Unified Medical Image Pre-training in Language-Guided Common Semantic Space [39.61770813855078]
We propose an Unified Medical Image Pre-training framework, namely UniMedI.
UniMedI uses diagnostic reports as common semantic space to create unified representations for diverse modalities of medical images.
We evaluate its performance on both 2D and 3D images across 10 different datasets.
arXiv Detail & Related papers (2023-11-24T22:01:12Z) - Customizing General-Purpose Foundation Models for Medical Report
Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z) - Domain Generalization for Mammographic Image Analysis with Contrastive
Learning [62.25104935889111]
The training of an efficacious deep learning model requires large data with diverse styles and qualities.
A novel contrastive learning is developed to equip the deep learning models with better style generalization capability.
The proposed method has been evaluated extensively and rigorously with mammograms from various vendor style domains and several public datasets.
arXiv Detail & Related papers (2023-04-20T11:40:21Z) - Learning Multi-Modal Brain Tumor Segmentation from Privileged
Semi-Paired MRI Images with Curriculum Disentanglement Learning [4.43142018105102]
We present a novel two-step (intra-modality and inter-modality) curriculum disentanglement learning framework for brain tumor segmentation.
In the first step, we propose to conduct reconstruction and segmentation with augmented intra-modality style-consistent images.
In the second step, the model jointly performs reconstruction, unsupervised/supervised translation, and segmentation for both unpaired and paired inter-modality images.
arXiv Detail & Related papers (2022-08-26T16:52:43Z) - Semantic segmentation of multispectral photoacoustic images using deep
learning [53.65837038435433]
Photoacoustic imaging has the potential to revolutionise healthcare.
Clinical translation of the technology requires conversion of the high-dimensional acquired data into clinically relevant and interpretable information.
We present a deep learning-based approach to semantic segmentation of multispectral photoacoustic images.
arXiv Detail & Related papers (2021-05-20T09:33:55Z) - Contextualized Keyword Representations for Multi-modal Retinal Image
Captioning [16.553644007702808]
A traditional medical image captioning model creates a medical description only based on a single medical image input.
A new end-to-end deep multi-modal medical image captioning model is proposed.
arXiv Detail & Related papers (2021-04-26T11:08:13Z) - Contrastive Learning of Medical Visual Representations from Paired
Images and Text [38.91117443316013]
We propose ConVIRT, an unsupervised strategy to learn medical visual representations by exploiting naturally occurring descriptive paired text.
Our new method of pretraining medical image encoders with the paired text data via a bidirectional contrastive objective between the two modalities is domain-agnostic, and requires no additional expert input.
arXiv Detail & Related papers (2020-10-02T02:10:18Z) - Weakly supervised multiple instance learning histopathological tumor
segmentation [51.085268272912415]
We propose a weakly supervised framework for whole slide imaging segmentation.
We exploit a multiple instance learning scheme for training models.
The proposed framework has been evaluated on multi-locations and multi-centric public data from The Cancer Genome Atlas and the PatchCamelyon dataset.
arXiv Detail & Related papers (2020-04-10T13:12:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.