What a Whole Slide Image Can Tell? Subtype-guided Masked Transformer for
Pathological Image Captioning
- URL: http://arxiv.org/abs/2310.20607v1
- Date: Tue, 31 Oct 2023 16:43:03 GMT
- Title: What a Whole Slide Image Can Tell? Subtype-guided Masked Transformer for
Pathological Image Captioning
- Authors: Wenkang Qin, Rui Xu, Peixiang Huang, Xiaomin Wu, Heyu Zhang and Lin
Luo
- Abstract summary: We propose a new paradigm Subtype-guided Masked Transformer (SGMT) for pathological captioning based on Transformers.
An accompanying subtype prediction is introduced into SGMT to guide the training process and enhance the captioning accuracy.
Experiments on the PatchGastricADC22 dataset demonstrate that our approach effectively adapts to the task with a transformer-based model.
- Score: 6.496515352848627
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pathological captioning of Whole Slide Images (WSIs), though is essential in
computer-aided pathological diagnosis, has rarely been studied due to the
limitations in datasets and model training efficacy. In this paper, we propose
a new paradigm Subtype-guided Masked Transformer (SGMT) for pathological
captioning based on Transformers, which treats a WSI as a sequence of sparse
patches and generates an overall caption sentence from the sequence. An
accompanying subtype prediction is introduced into SGMT to guide the training
process and enhance the captioning accuracy. We also present an Asymmetric
Masked Mechansim approach to tackle the large size constraint of pathological
image captioning, where the numbers of sequencing patches in SGMT are sampled
differently in the training and inferring phases, respectively. Experiments on
the PatchGastricADC22 dataset demonstrate that our approach effectively adapts
to the task with a transformer-based model and achieves superior performance
than traditional RNN-based methods. Our codes are to be made available for
further research and development.
Related papers
- Prompt-Guided Adaptive Model Transformation for Whole Slide Image Classification [27.21493446754789]
Multiple instance learning (MIL) has emerged as a popular method for classifying histopathology whole slide images (WSIs)
We propose Prompt-guided Adaptive Model Transformation framework that seamlessly adapts pre-trained models to the specific characteristics of histopathology data.
We rigorously evaluate our approach on two datasets, Camelyon16 and TCGA-NSCLC, showcasing substantial improvements across various MIL models.
arXiv Detail & Related papers (2024-03-19T08:23:12Z) - PathM3: A Multimodal Multi-Task Multiple Instance Learning Framework for Whole Slide Image Classification and Captioning [35.24716774767677]
We present PathM3, a multi-task, multiple instance learning framework for WSI classification and captioning.
Our method overcomes data scarcity in WSI-level captions by leveraging limited WSI diagnostic caption data.
arXiv Detail & Related papers (2024-03-13T21:19:12Z) - A self-supervised framework for learning whole slide representations [52.774822784847565]
We present Slide Pre-trained Transformers (SPT) for gigapixel-scale self-supervision of whole slide images.
We benchmark SPT visual representations on five diagnostic tasks across three biomedical microscopy datasets.
arXiv Detail & Related papers (2024-02-09T05:05:28Z) - Automatic Report Generation for Histopathology images using pre-trained Vision Transformers and BERT [1.0819408603463427]
We show that using an existing pre-trained Vision Transformer (ViT) to encode 4096x4096 sized patches of the Whole Slide Image (WSI) and a pre-trained Bidirectional Representations from Transformers (BERT) model for report generation, we can build a performant and portable report generation mechanism.
Our method allows us to not only generate and evaluate captions that describe the image, but also helps us classify the image into tissue types and the gender of the patient as well.
arXiv Detail & Related papers (2023-12-03T15:56:09Z) - Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection [76.11864242047074]
We propose a novel Affine-Consistent Transformer (AC-Former), which directly yields a sequence of nucleus positions.
We introduce an Adaptive Affine Transformer (AAT) module, which can automatically learn the key spatial transformations to warp original images for local network training.
Experimental results demonstrate that the proposed method significantly outperforms existing state-of-the-art algorithms on various benchmarks.
arXiv Detail & Related papers (2023-10-22T02:27:02Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Masked Pre-Training of Transformers for Histology Image Analysis [4.710921988115685]
In digital pathology, whole slide images (WSIs) are widely used for applications such as cancer diagnosis and prognosis prediction.
Visual transformer models have emerged as a promising method for encoding large regions of WSIs while preserving spatial relationships among patches.
We propose a pretext task for training the transformer model without labeled data to address this problem.
Our model, MaskHIT, uses the transformer output to reconstruct masked patches and learn representative histological features based on their positions and visual features.
arXiv Detail & Related papers (2023-04-14T23:56:49Z) - Language models are good pathologists: using attention-based sequence
reduction and text-pretrained transformers for efficient WSI classification [0.21756081703275998]
Whole Slide Image (WSI) analysis is usually formulated as a Multiple Instance Learning (MIL) problem.
We introduce textitSeqShort, a sequence shortening layer to summarize each WSI in a fixed- and short-sized sequence of instances.
We show that WSI classification performance can be improved when the downstream transformer architecture has been pre-trained on a large corpus of text data.
arXiv Detail & Related papers (2022-11-14T14:11:31Z) - Retrieval-based Spatially Adaptive Normalization for Semantic Image
Synthesis [68.1281982092765]
We propose a novel normalization module, termed as REtrieval-based Spatially AdaptIve normaLization (RESAIL)
RESAIL provides pixel level fine-grained guidance to the normalization architecture.
Experiments on several challenging datasets show that our RESAIL performs favorably against state-of-the-arts in terms of quantitative metrics, visual quality, and subjective evaluation.
arXiv Detail & Related papers (2022-04-06T14:21:39Z) - A Hierarchical Transformation-Discriminating Generative Model for Few
Shot Anomaly Detection [93.38607559281601]
We devise a hierarchical generative model that captures the multi-scale patch distribution of each training image.
The anomaly score is obtained by aggregating the patch-based votes of the correct transformation across scales and image regions.
arXiv Detail & Related papers (2021-04-29T17:49:48Z) - Diverse Semantic Image Synthesis via Probability Distribution Modeling [103.88931623488088]
We propose a novel diverse semantic image synthesis framework.
Our method can achieve superior diversity and comparable quality compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-03-11T18:59:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.