Related papers: PAUMER: Patch Pausing Transformer for Semantic Segmentation

Related papers

Image Coding for Machines via Feature-Preserving Rate-Distortion Optimization [27.97760974010369]
We show an approach to reduce the effect of compression on a task loss using the distance between features as a distortion metric. We simplify the RDO formulation to make the distortion term computable using block-based encoders. We show up to 10% bit-rate savings for the same computer vision accuracy compared to RDO based on SSE.
arXiv Detail & Related papers (2025-04-03T02:11:26Z)
Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models. We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
Normalized Matching Transformer [17.007956857855806]
We present a new state of the art approach for sparse keypoint matching between pairs of images. Our method consists of a fully deep learning based approach combined with a SplineCNN graph neural network for feature processing and a normalized transformer decoder for decoding keypoint correspondences together with the Sinkhorn algorithm.
arXiv Detail & Related papers (2025-03-22T10:09:11Z)
Token Cropr: Faster ViTs for Quite a Few Tasks [12.97062850155708]
We present a token pruner that uses auxiliary prediction heads that learn to select tokens end-to-end based on task relevance. We evaluate our method on image classification, semantic segmentation, object detection, and instance segmentation, and show speedups of 1.5 to 4x with small drops in performance.
arXiv Detail & Related papers (2024-12-01T20:58:29Z)
MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping [1.1557852082644071]
Few-shot Semantic addresses the challenge of segmenting objects in query images with only a handful of examples. We propose a new Few-shot Semantic framework based on the transformer architecture. Our model with only 1.5 million parameters demonstrates competitive performance while overcoming limitations of existing methodologies.
arXiv Detail & Related papers (2024-09-17T16:14:03Z)
PRANCE: Joint Token-Optimization and Structural Channel-Pruning for Adaptive ViT Inference [44.77064952091458]
PRANCE is a Vision Transformer compression framework that jointly optimize the activated channels and reduces tokens, based on the characteristics of inputs. We introduce a novel "Result-to-Go" training mechanism that models ViTs' inference process as a sequential decision process. Our framework is shown to be compatible with various token optimization techniques such as pruning, merging, and pruning-merging strategies.
arXiv Detail & Related papers (2024-07-06T09:04:27Z)
Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions. Mask2Former uses 50% of its compute only on the transformer encoder. This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer. We propose PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance.
arXiv Detail & Related papers (2024-04-23T01:34:20Z)
Early Fusion of Features for Semantic Segmentation [10.362589129094975]
This paper introduces a novel segmentation framework that integrates a classifier network with a reverse HRNet architecture for efficient image segmentation. Our methodology is rigorously tested across several benchmark datasets including Mapillary Vistas, Cityscapes, CamVid, COCO, and PASCAL-VOC2012. The results demonstrate the effectiveness of our proposed model in achieving high segmentation accuracy, indicating its potential for various applications in image analysis.
arXiv Detail & Related papers (2024-02-08T22:58:06Z)
Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model [10.473819332984005]
We propose a segmented recurrent transformer (SRformer) that combines segmented (local) attention with recurrent attention. The proposed model achieves $6-22%$ higher ROUGE1 scores than a segmented transformer and outperforms other recurrent transformer approaches.
arXiv Detail & Related papers (2023-05-24T03:47:22Z)
Inverse Quantum Fourier Transform Inspired Algorithm for Unsupervised Image Segmentation [2.4271601178529063]
A novel IQFT-inspired algorithm is proposed and implemented by leveraging the underlying mathematical structure of the IQFT. The proposed method takes advantage of the phase information of the pixels in the image by encoding the pixels' intensity into qubit relative phases and applying IQFT to classify the pixels into different segments automatically and efficiently. The proposed method outperforms both of them on the PASCAL VOC 2012 segmentation benchmark and the xVIEW2 challenge dataset by as much as 50% in terms of mean Intersection-Over-Union (mIOU)
arXiv Detail & Related papers (2023-01-11T20:28:44Z)
Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer. We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers. We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
Dense Gaussian Processes for Few-Shot Segmentation [66.08463078545306]
We propose a few-shot segmentation method based on dense Gaussian process (GP) regression. We exploit the end-to-end learning capabilities of our approach to learn a high-dimensional output space for the GP. Our approach sets a new state-of-the-art for both 1-shot and 5-shot FSS on the PASCAL-5$i$ and COCO-20$i$ benchmarks.
arXiv Detail & Related papers (2021-10-07T17:57:54Z)
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR) SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z)
Displacement-Invariant Cost Computation for Efficient Stereo Matching [122.94051630000934]
Deep learning methods have dominated stereo matching leaderboards by yielding unprecedented disparity accuracy. But their inference time is typically slow, on the order of seconds for a pair of 540p images. We propose a emphdisplacement-invariant cost module to compute the matching costs without needing a 4D feature volume.
arXiv Detail & Related papers (2020-12-01T23:58:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.