Multi-scale Efficient Graph-Transformer for Whole Slide Image
Classification
- URL: http://arxiv.org/abs/2305.15773v1
- Date: Thu, 25 May 2023 06:34:14 GMT
- Title: Multi-scale Efficient Graph-Transformer for Whole Slide Image
Classification
- Authors: Saisai Ding, Juncheng Li, Jun Wang, Shihui Ying, Jun Shi
- Abstract summary: We propose a novel Multi-scale Efficient Graph-Transformer (MEGT) framework for WSI classification.
The key idea of MEGT is to adopt two independent Efficient Graph-based Transformer (EGT) branches to process the low-resolution and high-resolution patch embeddings.
We propose a novel MFFM to alleviate the semantic gap among different resolution patches during feature fusion.
- Score: 16.19677745296922
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The multi-scale information among the whole slide images (WSIs) is essential
for cancer diagnosis. Although the existing multi-scale vision Transformer has
shown its effectiveness for learning multi-scale image representation, it still
cannot work well on the gigapixel WSIs due to their extremely large image
sizes. To this end, we propose a novel Multi-scale Efficient Graph-Transformer
(MEGT) framework for WSI classification. The key idea of MEGT is to adopt two
independent Efficient Graph-based Transformer (EGT) branches to process the
low-resolution and high-resolution patch embeddings (i.e., tokens in a
Transformer) of WSIs, respectively, and then fuse these tokens via a
multi-scale feature fusion module (MFFM). Specifically, we design an EGT to
efficiently learn the local-global information of patch tokens, which
integrates the graph representation into Transformer to capture spatial-related
information of WSIs. Meanwhile, we propose a novel MFFM to alleviate the
semantic gap among different resolution patches during feature fusion, which
creates a non-patch token for each branch as an agent to exchange information
with another branch by cross-attention. In addition, to expedite network
training, a novel token pruning module is developed in EGT to reduce the
redundant tokens. Extensive experiments on TCGA-RCC and CAMELYON16 datasets
demonstrate the effectiveness of the proposed MEGT.
Related papers
- SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers [0.0]
We introduce the Scale-Aware Graph Attention Vision Transformer (SAG-ViT), a novel framework that addresses this challenge by integrating multi-scale features.
Using EfficientNet as a backbone, the model extracts multi-scale feature maps, which are divided into patches to preserve semantic information.
The SAG-ViT is evaluated on benchmark datasets, demonstrating its effectiveness in enhancing image classification performance.
arXiv Detail & Related papers (2024-11-14T13:15:27Z) - Diagnose Like a Pathologist: Transformer-Enabled Hierarchical
Attention-Guided Multiple Instance Learning for Whole Slide Image
Classification [39.41442041007595]
Multiple Instance Learning and transformers are increasingly popular in histopathology Whole Slide Image (WSI) classification.
We propose a Hierarchical Attention-Guided Multiple Instance Learning framework to fully exploit the WSIs.
Within this framework, an Integrated Attention Transformer is proposed to further enhance the performance of the transformer.
arXiv Detail & Related papers (2023-01-19T15:38:43Z) - Hierarchical Transformer for Survival Prediction Using Multimodality
Whole Slide Images and Genomics [63.76637479503006]
Learning good representation of giga-pixel level whole slide pathology images (WSI) for downstream tasks is critical.
This paper proposes a hierarchical-based multimodal transformer framework that learns a hierarchical mapping between pathology images and corresponding genes.
Our architecture requires fewer GPU resources compared with benchmark methods while maintaining better WSI representation ability.
arXiv Detail & Related papers (2022-11-29T23:47:56Z) - Cross-receptive Focused Inference Network for Lightweight Image
Super-Resolution [64.25751738088015]
Transformer-based methods have shown impressive performance in single image super-resolution (SISR) tasks.
Transformers that need to incorporate contextual information to extract features dynamically are neglected.
We propose a lightweight Cross-receptive Focused Inference Network (CFIN) that consists of a cascade of CT Blocks mixed with CNN and Transformer.
arXiv Detail & Related papers (2022-07-06T16:32:29Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks.
We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer.
The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z) - XCiT: Cross-Covariance Image Transformers [73.33400159139708]
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens.
The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
arXiv Detail & Related papers (2021-06-17T17:33:35Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - ResT: An Efficient Transformer for Visual Recognition [5.807423409327807]
This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition.
We show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones.
arXiv Detail & Related papers (2021-05-28T08:53:54Z) - CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image
Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features.
Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity.
Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.