PM2: A New Prompting Multi-modal Model Paradigm for Few-shot Medical Image Classification
- URL: http://arxiv.org/abs/2404.08915v2
- Date: Sat, 25 May 2024 14:31:55 GMT
- Title: PM2: A New Prompting Multi-modal Model Paradigm for Few-shot Medical Image Classification
- Authors: Zhenwei Wang, Qiule Sun, Bingbing Zhang, Pengfei Wang, Jianxin Zhang, Qiang Zhang,
- Abstract summary: We propose a new prompting multi-modal model paradigm on medical image classification based on multi-modal foundation models, called PM2.
Besides image modality, PM2 introduces another supplementary text input, known as prompt, to further describe corresponding image or concept classes.
Our PM2 significantly outperforms counterparts regardless of prompt schemes and achieves state-of-the-art performance.
- Score: 12.628447384868503
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Few-shot learning has been successfully applied to medical image classification as only very few medical examples are available for training. Due to the challenging problem of limited number of annotated medical images, image representations should not be solely derived from a single image modality which is insufficient for characterizing concept classes. In this paper, we propose a new prompting multi-modal model paradigm on medical image classification based on multi-modal foundation models, called PM2. Besides image modality,PM2 introduces another supplementary text input, known as prompt, to further describe corresponding image or concept classes and facilitate few-shot learning across diverse modalities. To better explore the potential of prompt engineering, we empirically investigate five distinct prompt schemes under the new paradigm. Furthermore, linear probing in multi-modal models acts as a linear classification head taking as input only class token, which ignores completely merits of rich statistics inherent in high-level visual tokens. Thus, we alternatively perform a linear classification on feature distribution of visual tokens and class token simultaneously. To effectively mine such rich statistics, a global covariance pooling with efficient matrix power normalization is used to aggregate visual tokens. Then we study and combine two classification heads. One is shared for class token of image from vision encoder and prompt representation encoded by text encoder. The other is to classification on feature distribution of visual tokens from vision encoder. Extensive experiments on three medical datasets show that our PM2 significantly outperforms counterparts regardless of prompt schemes and achieves state-of-the-art performance.
Related papers
- Image Class Translation Distance: A Novel Interpretable Feature for Image Classification [0.0]
We propose a novel application of image translation networks for image classification.
We train a network to translate images between possible classes, and then quantify translation distance.
These translation distances can then be examined for clusters and trends, and can be fed directly to a simple classifier.
We demonstrate the approach on a toy 2-class scenario, apples versus oranges, and then apply it to two medical imaging tasks.
arXiv Detail & Related papers (2024-08-16T18:48:28Z) - Diversified in-domain synthesis with efficient fine-tuning for few-shot
classification [64.86872227580866]
Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class.
We propose DISEF, a novel approach which addresses the generalization challenge in few-shot learning using synthetic data.
We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification.
arXiv Detail & Related papers (2023-12-05T17:18:09Z) - Robust and Interpretable Medical Image Classifiers via Concept
Bottleneck Models [49.95603725998561]
We propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts.
Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model.
arXiv Detail & Related papers (2023-10-04T21:57:09Z) - Inherently Interpretable Multi-Label Classification Using Class-Specific
Counterfactuals [9.485195366036292]
Interpretability is essential for machine learning algorithms in high-stakes application fields such as medical image analysis.
We propose Attri-Net, an inherently interpretable model for multi-label classification.
We find that Attri-Net produces high-quality multi-label explanations consistent with clinical knowledge.
arXiv Detail & Related papers (2023-03-01T13:32:55Z) - Towards Unifying Medical Vision-and-Language Pre-training via Soft
Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used.
We propose an effective yet straightforward scheme named PTUnifier to unify the two types.
We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for
Image-Text Retrieval [108.48540976175457]
We propose Coupled Diversity-Sensitive Momentum Constrastive Learning (CODER) for improving cross-modal representation.
We introduce dynamic dictionaries for both modalities to enlarge the scale of image-text pairs, and diversity-sensitiveness is achieved by adaptive negative pair weighting.
Experiments conducted on two popular benchmarks, i.e. MSCOCO and Flicker30K, validate CODER remarkably outperforms the state-of-the-art approaches.
arXiv Detail & Related papers (2022-08-21T08:37:50Z) - Learning Discriminative Representation via Metric Learning for
Imbalanced Medical Image Classification [52.94051907952536]
We propose embedding metric learning into the first stage of the two-stage framework specially to help the feature extractor learn to extract more discriminative feature representations.
Experiments mainly on three medical image datasets show that the proposed approach consistently outperforms existing onestage and two-stage approaches.
arXiv Detail & Related papers (2022-07-14T14:57:01Z) - Deep Class-Specific Affinity-Guided Convolutional Network for Multimodal
Unpaired Image Segmentation [7.021001169318551]
Multi-modal medical image segmentation plays an essential role in clinical diagnosis.
It remains challenging as the input modalities are often not well-aligned spatially.
We propose an affinity-guided fully convolutional network for multimodal image segmentation.
arXiv Detail & Related papers (2021-01-05T13:56:51Z) - Contrastive Learning of Medical Visual Representations from Paired
Images and Text [38.91117443316013]
We propose ConVIRT, an unsupervised strategy to learn medical visual representations by exploiting naturally occurring descriptive paired text.
Our new method of pretraining medical image encoders with the paired text data via a bidirectional contrastive objective between the two modalities is domain-agnostic, and requires no additional expert input.
arXiv Detail & Related papers (2020-10-02T02:10:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.