Cascaded Cross-Attention Networks for Data-Efficient Whole-Slide Image
Classification Using Transformers
- URL: http://arxiv.org/abs/2305.06963v1
- Date: Thu, 11 May 2023 16:42:24 GMT
- Title: Cascaded Cross-Attention Networks for Data-Efficient Whole-Slide Image
Classification Using Transformers
- Authors: Firas Khader, Jakob Nikolas Kather, Tianyu Han, Sven Nebelung,
Christiane Kuhl, Johannes Stegmaier, Daniel Truhn
- Abstract summary: Whole-Slide Imaging allows for the capturing and digitization of high-resolution images of histological specimen.
transformer architecture has been proposed as a possible candidate for effectively leveraging the high-resolution information.
We propose a novel cascaded cross-attention network (CCAN) based on the cross-attention mechanism that scales linearly with the number of extracted patches.
- Score: 0.11219061154635457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Whole-Slide Imaging allows for the capturing and digitization of
high-resolution images of histological specimen. An automated analysis of such
images using deep learning models is therefore of high demand. The transformer
architecture has been proposed as a possible candidate for effectively
leveraging the high-resolution information. Here, the whole-slide image is
partitioned into smaller image patches and feature tokens are extracted from
these image patches. However, while the conventional transformer allows for a
simultaneous processing of a large set of input tokens, the computational
demand scales quadratically with the number of input tokens and thus
quadratically with the number of image patches. To address this problem we
propose a novel cascaded cross-attention network (CCAN) based on the
cross-attention mechanism that scales linearly with the number of extracted
patches. Our experiments demonstrate that this architecture is at least on-par
with and even outperforms other attention-based state-of-the-art methods on two
public datasets: On the use-case of lung cancer (TCGA NSCLC) our model reaches
a mean area under the receiver operating characteristic (AUC) of 0.970 $\pm$
0.008 and on renal cancer (TCGA RCC) reaches a mean AUC of 0.985 $\pm$ 0.004.
Furthermore, we show that our proposed model is efficient in low-data regimes,
making it a promising approach for analyzing whole-slide images in
resource-limited settings. To foster research in this direction, we make our
code publicly available on GitHub: XXX.
Related papers
- Adaptive Patching for High-resolution Image Segmentation with Transformers [9.525013089622183]
Attention-based models are proliferating in the space of image analytics, including segmentation.
Standard method of feeding images to transformer encoders is to divide the images into patches and then feed the patches to the model as a linear sequence of tokens.
For high-resolution images, e.g. microscopic pathology images, the quadratic compute and memory cost prohibits the use of an attention-based model, if we are to use smaller patch sizes that are favorable in segmentation.
We take inspiration from Adapative Mesh Refinement (AMR) methods in HPC by adaptively patching the images, as a pre-processing step, based
arXiv Detail & Related papers (2024-04-15T12:06:00Z) - Mixing Histopathology Prototypes into Robust Slide-Level Representations
for Cancer Subtyping [19.577541771516124]
Whole-slide image analysis via the means of computational pathology often relies on processing tessellated gigapixel images with only slide-level labels available.
Applying multiple instance learning-based methods or transformer models is computationally expensive as each image, all instances have to be processed simultaneously.
TheMixer is an under-explored alternative model to common vision transformers, especially for large-scale datasets.
arXiv Detail & Related papers (2023-10-19T14:15:20Z) - Pixel-Inconsistency Modeling for Image Manipulation Localization [59.968362815126326]
Digital image forensics plays a crucial role in image authentication and manipulation localization.
This paper presents a generalized and robust manipulation localization model through the analysis of pixel inconsistency artifacts.
Experiments show that our method successfully extracts inherent pixel-inconsistency forgery fingerprints.
arXiv Detail & Related papers (2023-09-30T02:54:51Z) - Super-Resolution of License Plate Images Using Attention Modules and
Sub-Pixel Convolution Layers [3.8831062015253055]
We introduce a Single-Image Super-Resolution (SISR) approach to enhance the detection of structural and textural features in surveillance images.
Our approach incorporates sub-pixel convolution layers and a loss function that uses an Optical Character Recognition (OCR) model for feature extraction.
Our results show that our approach for reconstructing these low-resolution synthesized images outperforms existing ones in both quantitative and qualitative measures.
arXiv Detail & Related papers (2023-05-27T00:17:19Z) - Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial
Representation Learning [55.762840052788945]
We present Scale-MAE, a pretraining method that explicitly learns relationships between data at different, known scales.
We find that tasking the network with reconstructing both low/high frequency images leads to robust multiscale representations for remote sensing imagery.
arXiv Detail & Related papers (2022-12-30T03:15:34Z) - Lossy Image Compression with Conditional Diffusion Models [25.158390422252097]
This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models.
In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model.
Our approach yields stronger reported FID scores than the GAN-based model, while also yielding competitive performance with VAE-based models in several distortion metrics.
arXiv Detail & Related papers (2022-09-14T21:53:27Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Variable-Rate Deep Image Compression through Spatially-Adaptive Feature
Transform [58.60004238261117]
We propose a versatile deep image compression network based on Spatial Feature Transform (SFT arXiv:1804.02815)
Our model covers a wide range of compression rates using a single model, which is controlled by arbitrary pixel-wise quality maps.
The proposed framework allows us to perform task-aware image compressions for various tasks.
arXiv Detail & Related papers (2021-08-21T17:30:06Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z) - Generating Images with Sparse Representations [21.27273495926409]
High dimensionality of images presents architecture and sampling-efficiency challenges for likelihood-based generative models.
We present an alternative approach, inspired by common image compression methods like JPEG, and convert images to quantized discrete cosine transform (DCT) blocks.
We propose a Transformer-based autoregressive architecture, which is trained to sequentially predict the conditional distribution of the next element in such sequences.
arXiv Detail & Related papers (2021-03-05T17:56:03Z) - Locally Masked Convolution for Autoregressive Models [107.4635841204146]
LMConv is a simple modification to the standard 2D convolution that allows arbitrary masks to be applied to the weights at each location in the image.
We learn an ensemble of distribution estimators that share parameters but differ in generation order, achieving improved performance on whole-image density estimation.
arXiv Detail & Related papers (2020-06-22T17:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.