Visual Language Pretrained Multiple Instance Zero-Shot Transfer for
Histopathology Images
- URL: http://arxiv.org/abs/2306.07831v1
- Date: Tue, 13 Jun 2023 15:05:24 GMT
- Title: Visual Language Pretrained Multiple Instance Zero-Shot Transfer for
Histopathology Images
- Authors: Ming Y. Lu, Bowen Chen, Andrew Zhang, Drew F.K. Williamson, Richard J.
Chen, Tong Ding, Long Phi Le, Yung-Sung Chuang, Faisal Mahmood
- Abstract summary: We present MI-Zero, a framework for unleashing the zero-shot transfer capabilities of contrastively aligned image and text models on gigapixel histopathology whole slide images.
MI-Zero reformulates zero-shot transfer under the framework of multiple instance learning to overcome the computational challenge of inference on extremely large images.
- Score: 8.612889476601822
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Contrastive visual language pretraining has emerged as a powerful method for
either training new language-aware image encoders or augmenting existing
pretrained models with zero-shot visual recognition capabilities. However,
existing works typically train on large datasets of image-text pairs and have
been designed to perform downstream tasks involving only small to medium
sized-images, neither of which are applicable to the emerging field of
computational pathology where there are limited publicly available paired
image-text datasets and each image can span up to 100,000 x 100,000 pixels. In
this paper we present MI-Zero, a simple and intuitive framework for unleashing
the zero-shot transfer capabilities of contrastively aligned image and text
models on gigapixel histopathology whole slide images, enabling multiple
downstream diagnostic tasks to be carried out by pretrained encoders without
requiring any additional labels. MI-Zero reformulates zero-shot transfer under
the framework of multiple instance learning to overcome the computational
challenge of inference on extremely large images. We used over 550k pathology
reports and other available in-domain text corpora to pre-train our text
encoder. By effectively leveraging strong pre-trained encoders, our best model
pretrained on over 33k histopathology image-caption pairs achieves an average
median zero-shot accuracy of 70.2% across three different real-world cancer
subtyping tasks. Our code is available at:
https://github.com/mahmoodlab/MI-Zero.
Related papers
- Disruptive Autoencoders: Leveraging Low-level features for 3D Medical
Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images.
We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations.
The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training [29.240131406803794]
We show that a common space can be created without any training at all, using single-domain encoders and a much smaller amount of image-text pairs.
Our model has unique properties, most notably, deploying a new version with updated training samples can be done in a matter of seconds.
arXiv Detail & Related papers (2022-10-04T16:56:22Z) - PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model
Pretraining [68.84339672878066]
We introduce PyramidCLIP, which constructs an input pyramid with different semantic levels, and aligns visual elements and linguistic elements in the form of hierarchy.
Experiments on three downstream tasks, including zero-shot image classification, zero-shot image-text retrieval and image object detection, verify the effectiveness of the proposed PyramidCLIP.
arXiv Detail & Related papers (2022-04-29T13:38:42Z) - Corrupted Image Modeling for Self-Supervised Visual Pre-Training [103.99311611776697]
We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training.
CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens.
After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks.
arXiv Detail & Related papers (2022-02-07T17:59:04Z) - Data Efficient Language-supervised Zero-shot Recognition with Optimal
Transport Distillation [43.03533959429743]
We propose OTTER, which uses online optimal transport to find a soft image-text match as labels for contrastive learning.
Based on pretrained image and text encoders, models trained with OTTER achieve strong performance with only 3M image text pairs.
arXiv Detail & Related papers (2021-12-17T11:27:26Z) - Learning to Prompt for Vision-Language Models [82.25005817904027]
Vision-language pre-training has emerged as a promising alternative for representation learning.
It shifts from the tradition of using images and discrete labels for learning a fixed set of weights, seen as visual concepts, to aligning images and raw text for two separate encoders.
Such a paradigm benefits from a broader source of supervision and allows zero-shot transfer to downstream tasks.
arXiv Detail & Related papers (2021-09-02T17:57:31Z) - Data-Efficient Language-Supervised Zero-Shot Learning with
Self-Distillation [23.631184498984933]
Natural language has been shown to be a broader and richer source of supervision than supervised "gold" labels.
We propose a data-efficient contrastive distillation method that uses soft labels to learn from noisy image-text pairs.
Our model transfers knowledge from pretrained image and sentence encoders and achieves strong performance with only 3M image text pairs, 133x smaller than CLIP.
arXiv Detail & Related papers (2021-04-18T19:55:31Z) - Self-supervised Image-text Pre-training With Mixed Data In Chest X-rays [10.398175542736285]
We introduce an image-text pre-training framework that can learn from mixed data inputs.
We demonstrate the feasibility of pre-training across mixed data inputs.
We also illustrate the benefits of adopting such pre-trained models in 3 chest X-ray applications.
arXiv Detail & Related papers (2021-03-30T01:48:46Z) - Learning Transferable Visual Models From Natural Language Supervision [13.866297967166089]
Learning directly from raw text about images is a promising alternative.
We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn.
SOTA image representations are learned from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
arXiv Detail & Related papers (2021-02-26T19:04:58Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.