Related papers: Rethinking Transformer for Long Contextual Histopathology Whole Slide Image Analysis

Rethinking Transformer for Long Contextual Histopathology Whole Slide Image Analysis

URL: http://arxiv.org/abs/2410.14195v1
Date: Fri, 18 Oct 2024 06:12:36 GMT
Title: Rethinking Transformer for Long Contextual Histopathology Whole Slide Image Analysis
Authors: Honglin Li, Yunlong Zhang, Pingyi Chen, Zhongyi Shui, Chenglu Zhu, Lin Yang,
Abstract summary: Histo Whole Slide Image (WSI) analysis serves as the gold standard for clinical cancer diagnosis in the daily routines of doctors. Previous methods typically employ Multi-pathology Learning to enable slide-level prediction given only slide-level labels. To alleviate the computational complexity of long sequences in large WSIs, methods like HIPT use region-slicing, and TransMIL employs approximation of full self-attention.
Score: 9.090504201460817
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Histopathology Whole Slide Image (WSI) analysis serves as the gold standard for clinical cancer diagnosis in the daily routines of doctors. To develop computer-aided diagnosis model for WSIs, previous methods typically employ Multi-Instance Learning to enable slide-level prediction given only slide-level labels. Among these models, vanilla attention mechanisms without pairwise interactions have traditionally been employed but are unable to model contextual information. More recently, self-attention models have been utilized to address this issue. To alleviate the computational complexity of long sequences in large WSIs, methods like HIPT use region-slicing, and TransMIL employs approximation of full self-attention. Both approaches suffer from suboptimal performance due to the loss of key information. Moreover, their use of absolute positional embedding struggles to effectively handle long contextual dependencies in shape-varying WSIs. In this paper, we first analyze how the low-rank nature of the long-sequence attention matrix constrains the representation ability of WSI modelling. Then, we demonstrate that the rank of attention matrix can be improved by focusing on local interactions via a local attention mask. Our analysis shows that the local mask aligns with the attention patterns in the lower layers of the Transformer. Furthermore, the local attention mask can be implemented during chunked attention calculation, reducing the quadratic computational complexity to linear with a small local bandwidth. Building on this, we propose a local-global hybrid Transformer for both computational acceleration and local-global information interactions modelling. Our method, Long-contextual MIL (LongMIL), is evaluated through extensive experiments on various WSI tasks to validate its superiority. Our code will be available at github.com/invoker-LL/Long-MIL.

Related papers

The Role of Graph-based MIL and Interventional Training in the Generalization of WSI Classifiers [8.867734798489037]
Whole Slide Imaging (WSI), which involves high-resolution digital scans of pathology slides, has become the gold standard for cancer diagnosis. Its gigapixel resolution and the scarcity of annotated datasets present challenges for deep learning models. We introduce a new framework, Graph-based Multiple Instance Learning with Interventional Training (GMIL-IT) for WSI classification.
arXiv Detail & Related papers (2025-01-31T11:21:08Z)
Agent Aggregator with Mask Denoise Mechanism for Histopathology Whole Slide Image Analysis [6.708196053187949]
Histopathology analysis is the gold standard for medical diagnosis. Accurate classification of whole slide images (WSIs) and region-of-interests (ROIs) localization can assist pathologists in diagnosis. In weakly supervised learning, multiple instance learning (MIL) presents a promising approach for WSI classification. We propose AMD-MIL, an agent aggregator with a mask denoise mechanism.
arXiv Detail & Related papers (2024-09-18T03:02:19Z)
MamMIL: Multiple Instance Learning for Whole Slide Images with State Space Models [56.37780601189795]
We propose a framework named MamMIL for WSI analysis. We represent each WSI as an undirected graph. To address the problem that Mamba can only process 1D sequences, we propose a topology-aware scanning mechanism.
arXiv Detail & Related papers (2024-03-08T09:02:13Z)
Long-MIL: Scaling Long Contextual Multiple Instance Learning for Histopathology Whole Slide Image Analysis [9.912061800841267]
Whole Slide Image (WSI) of histopathology tissue is used for analysis. Previous methods generally divide the WSI into a large number of patches, then aggregate all patches within a WSI to make the slide-level prediction. We propose to amend position embedding for shape varying long-contextual WSI by introducing Linear Bias into Attention.
arXiv Detail & Related papers (2023-11-21T03:08:47Z)
Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection [76.11864242047074]
We propose a novel Affine-Consistent Transformer (AC-Former), which directly yields a sequence of nucleus positions. We introduce an Adaptive Affine Transformer (AAT) module, which can automatically learn the key spatial transformations to warp original images for local network training. Experimental results demonstrate that the proposed method significantly outperforms existing state-of-the-art algorithms on various benchmarks.
arXiv Detail & Related papers (2023-10-22T02:27:02Z)
Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement Learning [53.00683059396803]
Mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images. We propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy. Our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation.
arXiv Detail & Related papers (2023-10-06T10:40:46Z)
Laplacian-Former: Overcoming the Limitations of Vision Transformers in Local Texture Detection [3.784298636620067]
Vision Transformer (ViT) models have demonstrated a breakthrough in a wide range of computer vision tasks. These models struggle to capture high-frequency components of images, which can limit their ability to detect local textures and edge information. We propose a new technique, Laplacian-Former, that enhances the self-attention map by adaptively re-calibrating the frequency information in a Laplacian pyramid.
arXiv Detail & Related papers (2023-08-31T19:56:14Z)
TPMIL: Trainable Prototype Enhanced Multiple Instance Learning for Whole Slide Image Classification [13.195971707693365]
We develop a Trainable Prototype enhanced deep MIL framework for weakly supervised WSI classification. Our method is able to reveal the correlations between different tumor subtypes through distances between corresponding trained prototypes. We test our method on two WSI datasets and it achieves a new SOTA.
arXiv Detail & Related papers (2023-05-01T07:39:19Z)
Global-to-Local Modeling for Video-based 3D Human Pose and Shape Estimation [53.04781510348416]
Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness. We propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT) Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.
arXiv Detail & Related papers (2023-03-26T14:57:49Z)
Hierarchical Transformer for Survival Prediction Using Multimodality Whole Slide Images and Genomics [63.76637479503006]
Learning good representation of giga-pixel level whole slide pathology images (WSI) for downstream tasks is critical. This paper proposes a hierarchical-based multimodal transformer framework that learns a hierarchical mapping between pathology images and corresponding genes. Our architecture requires fewer GPU resources compared with benchmark methods while maintaining better WSI representation ability.
arXiv Detail & Related papers (2022-11-29T23:47:56Z)
Local Attention Graph-based Transformer for Multi-target Genetic Alteration Prediction [0.22940141855172028]
We propose a general-purpose local attention graph-based Transformer for MIL (LA-MIL) We demonstrate that LA-MIL achieves state-of-the-art results in mutation prediction for gastrointestinal cancer. This suggests that local self-attention sufficiently models dependencies on par with global modules.
arXiv Detail & Related papers (2022-05-13T14:24:24Z)
Learning A 3D-CNN and Transformer Prior for Hyperspectral Image Super-Resolution [80.93870349019332]
We propose a novel HSISR method that uses Transformer instead of CNN to learn the prior of HSIs. Specifically, we first use the gradient algorithm to solve the HSISR model, and then use an unfolding network to simulate the iterative solution processes.
arXiv Detail & Related papers (2021-11-27T15:38:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.