Token Sparsification for Faster Medical Image Segmentation
- URL: http://arxiv.org/abs/2303.06522v1
- Date: Sat, 11 Mar 2023 23:59:13 GMT
- Title: Token Sparsification for Faster Medical Image Segmentation
- Authors: Lei Zhou, Huidong Liu, Joseph Bae, Junjun He, Dimitris Samaras,
Prateek Prasanna
- Abstract summary: We reformulate segmentation as a sparse encoding -> token completion -> dense decoding (SCD) pipeline.
STP predicts importance scores with a lightweight sub-network and samples the topK tokens.
MTA restores a full token sequence by assembling both sparse output tokens and pruned multi-layer intermediate ones.
- Score: 37.25161294917211
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Can we use sparse tokens for dense prediction, e.g., segmentation? Although
token sparsification has been applied to Vision Transformers (ViT) to
accelerate classification, it is still unknown how to perform segmentation from
sparse tokens. To this end, we reformulate segmentation as a sparse encoding ->
token completion -> dense decoding (SCD) pipeline. We first empirically show
that naively applying existing approaches from classification token pruning and
masked image modeling (MIM) leads to failure and inefficient training caused by
inappropriate sampling algorithms and the low quality of the restored dense
features. In this paper, we propose Soft-topK Token Pruning (STP) and
Multi-layer Token Assembly (MTA) to address these problems. In sparse encoding,
STP predicts token importance scores with a lightweight sub-network and samples
the topK tokens. The intractable topK gradients are approximated through a
continuous perturbed score distribution. In token completion, MTA restores a
full token sequence by assembling both sparse output tokens and pruned
multi-layer intermediate ones. The last dense decoding stage is compatible with
existing segmentation decoders, e.g., UNETR. Experiments show SCD pipelines
equipped with STP and MTA are much faster than baselines without token pruning
in both training (up to 120% higher throughput and inference up to 60.6% higher
throughput) while maintaining segmentation quality.
Related papers
- FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - Tokens on Demand: Token Condensation as Training-free Test-time Adaptation [43.09801987385207]
Token Condensation as Adaptation (TCA) is a training-free approach designed to mitigate distribution shifts encountered by vision-language models (VLMs) during test-time inference.
As the first method to explore token efficiency in test-time adaptation, TCA consistently demonstrates superior performance across cross-dataset and out-of-distribution adaptation tasks.
arXiv Detail & Related papers (2024-10-16T07:13:35Z) - Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding [54.532578213126065]
Most document understanding methods preserve all tokens within sub-images and treat them equally.
This neglects their different informativeness and leads to a significant increase in the number of image tokens.
We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
arXiv Detail & Related papers (2024-07-19T16:11:15Z) - TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction [61.295716741720284]
TokenUnify is a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction.
Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution.
This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date.
arXiv Detail & Related papers (2024-05-27T05:45:51Z) - EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models [40.651650382105636]
Vanilla method adds padding tokens in order to ensure that the number of new tokens remains consistent across samples.
We propose a novel method that can resolve the issue of inconsistent tokens accepted by different samples without necessitating an increase in memory or computing overhead.
Our proposed method can handle the situation where the prediction tokens of different samples are inconsistent without the need to add padding tokens.
arXiv Detail & Related papers (2024-05-13T08:24:21Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - Dynamic Token Pruning in Plain Vision Transformers for Semantic
Segmentation [18.168932826183024]
This work introduces a Dynamic Token Pruning (DToP) method based on the early exit of tokens for semantic segmentation.
Experiments suggest that the proposed DToP architecture reduces on average $20% - 35%$ of computational cost for current semantic segmentation methods.
arXiv Detail & Related papers (2023-08-02T09:40:02Z) - RetroMAE v2: Duplex Masked Auto-Encoder For Pre-Training
Retrieval-Oriented Language Models [3.4523793651427113]
We propose duplex masked auto-encoder, a.k.a. DupMAE, which targets on improving the semantic representation capacity for contextualized embeddings of both [] and ordinary tokens.
DupMAE is simple but empirically competitive: with a small decoding cost, it substantially contributes to the model's representation capability and transferability.
arXiv Detail & Related papers (2022-11-16T08:57:55Z) - CenterCLIP: Token Clustering for Efficient Text-Video Retrieval [67.21528544724546]
In CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive frames in videos.
This significantly increases computation costs and hinders the deployment of video retrieval models in web applications.
In this paper, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
arXiv Detail & Related papers (2022-05-02T12:02:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.