PAUMER: Patch Pausing Transformer for Semantic Segmentation
- URL: http://arxiv.org/abs/2311.00586v1
- Date: Wed, 1 Nov 2023 15:32:11 GMT
- Title: PAUMER: Patch Pausing Transformer for Semantic Segmentation
- Authors: Evann Courdier, Prabhu Teja Sivaprasad, Fran\c{c}ois Fleuret
- Abstract summary: We study the problem of improving the efficiency of segmentation transformers by using disparate amounts of computation for different parts of the image.
Our method, PAUMER, accomplishes this by pausing computation for patches that are deemed to not need any more computation before the final decoder.
- Score: 3.3148826359547523
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the problem of improving the efficiency of segmentation transformers
by using disparate amounts of computation for different parts of the image. Our
method, PAUMER, accomplishes this by pausing computation for patches that are
deemed to not need any more computation before the final decoder. We use the
entropy of predictions computed from intermediate activations as the pausing
criterion, and find this aligns well with semantics of the image. Our method
has a unique advantage that a single network trained with the proposed strategy
can be effortlessly adapted at inference to various run-time requirements by
modulating its pausing parameters. On two standard segmentation datasets,
Cityscapes and ADE20K, we show that our method operates with about a $50\%$
higher throughput with an mIoU drop of about $0.65\%$ and $4.6\%$ respectively.
Related papers
- MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping [1.1557852082644071]
Few-shot Semantic addresses the challenge of segmenting objects in query images with only a handful of examples.
We propose a new Few-shot Semantic framework based on the transformer architecture.
Our model with only 1.5 million parameters demonstrates competitive performance while overcoming limitations of existing methodologies.
arXiv Detail & Related papers (2024-09-17T16:14:03Z) - PRANCE: Joint Token-Optimization and Structural Channel-Pruning for Adaptive ViT Inference [44.77064952091458]
PRANCE is a Vision Transformer compression framework that jointly optimize the activated channels and reduces tokens, based on the characteristics of inputs.
We introduce a novel "Result-to-Go" training mechanism that models ViTs' inference process as a sequential decision process.
Our framework is shown to be compatible with various token optimization techniques such as pruning, merging, and pruning-merging strategies.
arXiv Detail & Related papers (2024-07-06T09:04:27Z) - Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions.
Mask2Former uses 50% of its compute only on the transformer encoder.
This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer.
We propose PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance.
arXiv Detail & Related papers (2024-04-23T01:34:20Z) - Early Fusion of Features for Semantic Segmentation [10.362589129094975]
This paper introduces a novel segmentation framework that integrates a classifier network with a reverse HRNet architecture for efficient image segmentation.
Our methodology is rigorously tested across several benchmark datasets including Mapillary Vistas, Cityscapes, CamVid, COCO, and PASCAL-VOC2012.
The results demonstrate the effectiveness of our proposed model in achieving high segmentation accuracy, indicating its potential for various applications in image analysis.
arXiv Detail & Related papers (2024-02-08T22:58:06Z) - Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model [10.473819332984005]
We propose a segmented recurrent transformer (SRformer) that combines segmented (local) attention with recurrent attention.
The proposed model achieves $6-22%$ higher ROUGE1 scores than a segmented transformer and outperforms other recurrent transformer approaches.
arXiv Detail & Related papers (2023-05-24T03:47:22Z) - Inverse Quantum Fourier Transform Inspired Algorithm for Unsupervised
Image Segmentation [2.4271601178529063]
A novel IQFT-inspired algorithm is proposed and implemented by leveraging the underlying mathematical structure of the IQFT.
The proposed method takes advantage of the phase information of the pixels in the image by encoding the pixels' intensity into qubit relative phases and applying IQFT to classify the pixels into different segments automatically and efficiently.
The proposed method outperforms both of them on the PASCAL VOC 2012 segmentation benchmark and the xVIEW2 challenge dataset by as much as 50% in terms of mean Intersection-Over-Union (mIOU)
arXiv Detail & Related papers (2023-01-11T20:28:44Z) - Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer.
We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers.
We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Dense Gaussian Processes for Few-Shot Segmentation [66.08463078545306]
We propose a few-shot segmentation method based on dense Gaussian process (GP) regression.
We exploit the end-to-end learning capabilities of our approach to learn a high-dimensional output space for the GP.
Our approach sets a new state-of-the-art for both 1-shot and 5-shot FSS on the PASCAL-5$i$ and COCO-20$i$ benchmarks.
arXiv Detail & Related papers (2021-10-07T17:57:54Z) - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective
with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches.
With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR)
SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z) - Displacement-Invariant Cost Computation for Efficient Stereo Matching [122.94051630000934]
Deep learning methods have dominated stereo matching leaderboards by yielding unprecedented disparity accuracy.
But their inference time is typically slow, on the order of seconds for a pair of 540p images.
We propose a emphdisplacement-invariant cost module to compute the matching costs without needing a 4D feature volume.
arXiv Detail & Related papers (2020-12-01T23:58:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.