Related papers: PathVQ: Reforming Computational Pathology Foundation Model for Whole Slide Image Analysis via Vector Quantization

PathVQ: Reforming Computational Pathology Foundation Model for Whole Slide Image Analysis via Vector Quantization

URL: http://arxiv.org/abs/2503.06482v1
Date: Sun, 09 Mar 2025 06:51:08 GMT
Title: PathVQ: Reforming Computational Pathology Foundation Model for Whole Slide Image Analysis via Vector Quantization
Authors: Honglin Li, Zhongyi Shui, Yunlong Zhang, Chenglu Zhu, Lin Yang,
Abstract summary: Computational pathology and whole-slide image (WSI) analysis are pivotal in cancer diagnosis and prognosis.<n>Recent advancements in pathology foundation models have improved performance, yet most approaches rely on [] token representation of tile ViT as slide-level inputs.<n>This discards critical spatial details from patch tokens, limiting downstream WSI analysis tasks.<n>We introduce vector quantized (VQ) distillation on patch feature, which efficiently compresses spatial patch tokens.
Score: 9.632442075645542
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Computational pathology and whole-slide image (WSI) analysis are pivotal in cancer diagnosis and prognosis. However, the ultra-high resolution of WSIs presents significant modeling challenges. Recent advancements in pathology foundation models have improved performance, yet most approaches rely on [CLS] token representation of tile ViT as slide-level inputs (16x16 pixels is refereed as patch and 224x224 pixels as tile). This discards critical spatial details from patch tokens, limiting downstream WSI analysis tasks. We find that leveraging all spatial patch tokens benefits WSI analysis but incurs nearly 200x higher storage and training costs (e.g., 196 tokens in ViT$_{224}$). To address this, we introduce vector quantized (VQ) distillation on patch feature, which efficiently compresses spatial patch tokens using discrete indices and a decoder. Our method reduces token dimensionality from 1024 to 16, achieving a 64x compression rate while preserving reconstruction fidelity. Furthermore, we employ a multi-scale VQ (MSVQ) strategy, which not only enhances VQ reconstruction performance but also serves as a Self-supervised Learning (SSL) supervision for a seamless slide-level pretraining objective. Built upon the quantized patch features and supervision targets of tile via MSVQ, we develop a progressive convolutional module and slide-level SSL to extract representations with rich spatial-information for downstream WSI tasks. Extensive evaluations on multiple datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance in WSI analysis. Code will be available soon.

Related papers

Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning [82.39668822222386]
Vision token pruning has proven to be an effective acceleration technique for the efficient Vision Language Model (VLM)<n>We propose $textNwa$, a two-stage token pruning framework that enables efficient feature aggregation while maintaining spatial integrity.<n>Experiments demonstrate that $textNwa$ achieves SOTA performance on multiple VQA benchmarks (from 94% to 95%) and yields substantial improvements on visual grounding tasks (from 7% to 47%)
arXiv Detail & Related papers (2026-02-03T00:51:03Z)
DINO-Tok: Adapting DINO for Visual Tokenizers [52.194754463297706]
DINO-Tok is a visual tokenizer that unifies hierarchical representations into an information-complete latent space.<n>On ImageNet, DINO-Tok achieves state-of-the-art reconstruction performance, reaching 28.54 PSNR for autoencoding and 23.98 PSNR for VQ-based modeling.
arXiv Detail & Related papers (2025-11-25T18:00:00Z)
Back to Fundamentals: Low-Level Visual Features Guided Progressive Token Pruning [8.284127681482202]
'LVTP' is a progressive token pruning framework guided by multi-scale Tsallis entropy and low-level visual features with twice clustering. It integrates high-level semantics and basic visual attributes for precise segmentation. As a plug-and-play module, it requires no architectural changes or additional training.
arXiv Detail & Related papers (2025-04-25T00:43:20Z)
PySpatial: A High-Speed Whole Slide Image Pathomics Toolkit [5.52658544303762]
We present PySpatial, a high-speed pathomics toolkit specifically for WSI-level analysis.<n> PySpatial streamlines the conventional pipeline by directly operating on computational regions of interest.<n>Our experiments on two datasets-Perivascular Epithelioid Cell (PEC) and data from the Kidney Precision Medicine Project (KPMP)-demonstrate substantial performance improvements.
arXiv Detail & Related papers (2025-01-10T18:24:00Z)
Semantics Prompting Data-Free Quantization for Low-Bit Vision Transformers [59.772673692679085]
We propose SPDFQ, a Semantics Prompting Data-Free Quantization method for ViTs.<n>First, SPDFQ incorporates Attention Priors Alignment (APA), which uses randomly generated attention priors to enhance the semantics of synthetic images.<n>Second, SPDFQ introduces Multi-Semantic Reinforcement (MSR), which utilizes localized patch optimization to prompt efficient parameterization.
arXiv Detail & Related papers (2024-12-21T09:30:45Z)
XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation [54.2574228021317]
We present XQ-GAN, an image tokenization framework designed for both image reconstruction and generation tasks.<n>Our framework integrates state-of-the-art quantization techniques, including vector quantization (VQ), residual quantization (RQ), multi-scale residual quantization (MSVQ), product quantization (PQ), and binary spherical quantization (BSQ)<n>On the standard ImageNet 256x256 benchmark, our released model achieves an rFID of 0.64, significantly surpassing MAGVIT-v2 (0.9 rFID) and VAR (0.9 rFID)
arXiv Detail & Related papers (2024-12-02T17:58:06Z)
PATHS: A Hierarchical Transformer for Efficient Whole Slide Image Analysis [9.862551438475666]
We propose a novel top-down method for hierarchical weakly supervised representation learning on slide-level tasks in computational pathology.<n>PATHS is inspired by the cross-magnification manner in which a human pathologist examines a slide, filtering patches at each magnification level to a small subset relevant to the diagnosis.<n>We apply PATHS to five datasets of The Cancer Genome Atlas (TCGA), and achieve superior performance on slide-level prediction tasks.
arXiv Detail & Related papers (2024-11-27T11:03:38Z)
A self-supervised framework for learning whole slide representations [52.774822784847565]
We present Slide Pre-trained Transformers (SPT) for gigapixel-scale self-supervision of whole slide images. We benchmark SPT visual representations on five diagnostic tasks across three biomedical microscopy datasets.
arXiv Detail & Related papers (2024-02-09T05:05:28Z)
AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. We propose to apply adaptive resolution for different regions in the image according to their importance. We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z)
LESS: Label-efficient Multi-scale Learning for Cytological Whole Slide Image Screening [19.803614403803962]
We propose a weakly-supervised Label-Efficient WSI Screening method, dubbed LESS, for cytological WSI analysis with only slide-level labels. We provide appropriate supervision by using slide-level labels to improve the learning of patch-level features. It outperforms state-of-the-art MIL methods on pathology WSIs and realizes automatic cytological WSI cancer screening.
arXiv Detail & Related papers (2023-06-06T05:09:20Z)
Task-specific Fine-tuning via Variational Information Bottleneck for Weakly-supervised Pathology Whole Slide Image Classification [10.243293283318415]
Multiple Instance Learning (MIL) has shown promising results in digital Pathology Whole Slide Image (WSI) classification. We propose an efficient WSI fine-tuning framework motivated by the Information Bottleneck theory. Our framework is evaluated on five pathology WSI datasets on various WSI heads.
arXiv Detail & Related papers (2023-03-15T08:41:57Z)
Hierarchical Transformer for Survival Prediction Using Multimodality Whole Slide Images and Genomics [63.76637479503006]
Learning good representation of giga-pixel level whole slide pathology images (WSI) for downstream tasks is critical. This paper proposes a hierarchical-based multimodal transformer framework that learns a hierarchical mapping between pathology images and corresponding genes. Our architecture requires fewer GPU resources compared with benchmark methods while maintaining better WSI representation ability.
arXiv Detail & Related papers (2022-11-29T23:47:56Z)
Coarse-to-Fine Sparse Transformer for Hyperspectral Image Reconstruction [138.04956118993934]
We propose a novel Transformer-based method, coarse-to-fine sparse Transformer (CST) CST embedding HSI sparsity into deep learning for HSI reconstruction. In particular, CST uses our proposed spectra-aware screening mechanism (SASM) for coarse patch selecting. Then the selected patches are fed into our customized spectra-aggregation hashing multi-head self-attention (SAH-MSA) for fine pixel clustering and self-similarity capturing.
arXiv Detail & Related papers (2022-03-09T16:17:47Z)
An Efficient Cervical Whole Slide Image Analysis Framework Based on Multi-scale Semantic and Spatial Features using Deep Learning [2.7218168309244652]
This study designs a novel inline connection network (InCNet) by enriching the multi-scale connectivity to build the lightweight model named You Only Look Cytopathology Once (YOLCO) The proposed model allows the input size enlarged to megapixel that can stitch the WSI without any overlap by the average repeats. Based on Transformer for classifying the integrated multi-scale multi-task features, the experimental results appear $0.872$ AUC score better and $2.51times$ faster than the best conventional method in WSI classification.
arXiv Detail & Related papers (2021-06-29T06:24:55Z)
Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length. It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity. Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.