PRIOR: Prototype Representation Joint Learning from Medical Images and
Reports
- URL: http://arxiv.org/abs/2307.12577v3
- Date: Mon, 11 Mar 2024 10:25:58 GMT
- Title: PRIOR: Prototype Representation Joint Learning from Medical Images and
Reports
- Authors: Pujin Cheng, Li Lin, Junyan Lyu, Yijin Huang, Wenhan Luo, Xiaoying
Tang
- Abstract summary: We present a prototype representation learning framework incorporating both global and local alignment between medical images and reports.
In contrast to standard global multi-modality alignment methods, we employ a local alignment module for fine-grained representation.
A sentence-wise prototype memory bank is constructed, enabling the network to focus on low-level localized visual and high-level clinical linguistic features.
- Score: 19.336988866061294
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive learning based vision-language joint pre-training has emerged as
a successful representation learning strategy. In this paper, we present a
prototype representation learning framework incorporating both global and local
alignment between medical images and reports. In contrast to standard global
multi-modality alignment methods, we employ a local alignment module for
fine-grained representation. Furthermore, a cross-modality conditional
reconstruction module is designed to interchange information across modalities
in the training phase by reconstructing masked images and reports. For
reconstructing long reports, a sentence-wise prototype memory bank is
constructed, enabling the network to focus on low-level localized visual and
high-level clinical linguistic features. Additionally, a non-auto-regressive
generation paradigm is proposed for reconstructing non-sequential reports.
Experimental results on five downstream tasks, including supervised
classification, zero-shot classification, image-to-text retrieval, semantic
segmentation, and object detection, show the proposed method outperforms other
state-of-the-art methods across multiple datasets and under different dataset
size settings. The code is available at https://github.com/QtacierP/PRIOR.
Related papers
- Correlation Weighted Prototype-based Self-Supervised One-Shot Segmentation of Medical Images [12.365801596593936]
Medical image segmentation is one of the domains where sufficient annotated data is not available.
We propose a prototype-based self-supervised one-way one-shot learning framework using pseudo-labels generated from superpixels.
We show that the proposed simple but potent framework performs at par with the state-of-the-art methods.
arXiv Detail & Related papers (2024-08-12T15:38:51Z) - Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation [44.008094698200026]
FreeDA is a training-free diffusion-augmented method for open-vocabulary semantic segmentation.
FreeDA achieves state-of-the-art performance on five datasets.
arXiv Detail & Related papers (2024-04-09T18:00:25Z) - LPN: Language-guided Prototypical Network for few-shot classification [16.37959398470535]
Few-shot classification aims to adapt to new tasks with limited labeled examples.
Recent methods explore suitable measures for the similarity between the query and support images.
We propose a Language-guided Prototypical Network (LPN) for few-shot classification.
arXiv Detail & Related papers (2023-07-04T06:54:01Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z) - Mining Contextual Information Beyond Image for Semantic Segmentation [37.783233906684444]
The paper studies the context aggregation problem in semantic image segmentation.
It proposes to mine the contextual information beyond individual images to further augment the pixel representations.
The proposed method could be effortlessly incorporated into existing segmentation frameworks.
arXiv Detail & Related papers (2021-08-26T14:34:23Z) - Region Comparison Network for Interpretable Few-shot Image
Classification [97.97902360117368]
Few-shot image classification has been proposed to effectively use only a limited number of labeled examples to train models for new classes.
We propose a metric learning based method named Region Comparison Network (RCN), which is able to reveal how few-shot learning works.
We also present a new way to generalize the interpretability from the level of tasks to categories.
arXiv Detail & Related papers (2020-09-08T07:29:05Z) - Contextual Encoder-Decoder Network for Visual Saliency Prediction [42.047816176307066]
We propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task.
We combine the resulting representations with global scene information for accurately predicting visual saliency.
Compared to state of the art approaches, the network is based on a lightweight image classification backbone.
arXiv Detail & Related papers (2019-02-18T16:15:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.