Related papers: Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers

Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers

URL: http://arxiv.org/abs/2306.02095v1
Date: Sat, 3 Jun 2023 12:05:07 GMT
Title: Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers
Authors: Chenyang Lu, Daan de Geus, Gijs Dubbelman
Abstract summary: This paper introduces Content-aware Token Sharing (CTS), a token reduction approach that improves the computational efficiency of semantic segmentation networks. We employ a class-agnostic policy network that predicts if image patches contain the same semantic class, and lets them share a token if they do. With Content-aware Token Sharing, we are able to reduce the number of processed tokens by up to 44%, without diminishing the segmentation quality.
Score: 5.910159499687659
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces Content-aware Token Sharing (CTS), a token reduction approach that improves the computational efficiency of semantic segmentation networks that use Vision Transformers (ViTs). Existing works have proposed token reduction approaches to improve the efficiency of ViT-based image classification networks, but these methods are not directly applicable to semantic segmentation, which we address in this work. We observe that, for semantic segmentation, multiple image patches can share a token if they contain the same semantic class, as they contain redundant information. Our approach leverages this by employing an efficient, class-agnostic policy network that predicts if image patches contain the same semantic class, and lets them share a token if they do. With experiments, we explore the critical design choices of CTS and show its effectiveness on the ADE20K, Pascal Context and Cityscapes datasets, various ViT backbones, and different segmentation decoders. With Content-aware Token Sharing, we are able to reduce the number of processed tokens by up to 44%, without diminishing the segmentation quality.

Related papers

Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation [8.659766913542938]
We study a united perceptual and semantic token compression for all granular understanding. We propose Feature Pyramid Tokenization (PAT) to cluster and represent multi-resolution feature by learnable codebooks. Our experiments show that PAT enhances the semantic intuition of VLM feature pyramid.
arXiv Detail & Related papers (2024-12-18T18:43:21Z)
TCFormer: Visual Recognition via Token Clustering Transformer [79.24723479088097]
We propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning. Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens.
arXiv Detail & Related papers (2024-07-16T02:26:18Z)
Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning [41.81009725976217]
We provide semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework. We demonstrate notable improvements over ViTs in learned representation quality across text-to-image and image-to-text retrieval tasks.
arXiv Detail & Related papers (2024-05-26T01:46:22Z)
Subobject-level Image Tokenization [60.80949852899857]
Patch-based image tokenization ignores the morphology of the visual world. Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation. We show that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.
arXiv Detail & Related papers (2024-02-22T06:47:44Z)
Multi-Scale Semantic Segmentation with Modified MBConv Blocks [29.026787888644474]
We introduce a novel adaptation of MBConv blocks specifically tailored for semantic segmentation. By implementing these changes, our approach achieves impressive mean Intersection over Union (IoU) scores of 84.5% and 84.0% on the Cityscapes test and validation datasets.
arXiv Detail & Related papers (2024-02-07T07:01:08Z)
Making Vision Transformers Efficient from A Token Sparsification View [26.42498120556985]
We propose a novel Semantic Token ViT (STViT) for efficient global and local vision transformers. Our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone. In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks.
arXiv Detail & Related papers (2023-03-15T15:12:36Z)
Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs) We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies. We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z)
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval [67.21528544724546]
In CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive frames in videos. This significantly increases computation costs and hinders the deployment of video retrieval models in web applications. In this paper, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
arXiv Detail & Related papers (2022-05-02T12:02:09Z)
Self-supervision through Random Segments with Autoregressive Coding (RandSAC) [46.519302668058025]
We explore the effects various design choices have on the success of applying such training strategies for visual feature learning. Specifically, we introduce a novel strategy that we call Random Segments with Autoregressive Coding (RandSAC) In RandSAC, we group patch representations (image tokens) into hierarchically arranged segments; within each segment, tokens are predicted in parallel, similar to BERT, while across segment predictions are sequential, similar to GPT. We illustrate that randomized serialization of the segments significantly improves the performance and results in distribution over spatially-long (across-segments) and -short (within-
arXiv Detail & Related papers (2022-03-22T21:28:55Z)
Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features. For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning. In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z)
Leveraging Auxiliary Tasks with Affinity Learning for Weakly Supervised Semantic Segmentation [88.49669148290306]
We propose a novel weakly supervised multi-task framework called AuxSegNet to leverage saliency detection and multi-label image classification as auxiliary tasks. Inspired by their similar structured semantics, we also propose to learn a cross-task global pixel-level affinity map from the saliency and segmentation representations. The learned cross-task affinity can be used to refine saliency predictions and propagate CAM maps to provide improved pseudo labels for both tasks.
arXiv Detail & Related papers (2021-07-25T11:39:58Z)
Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation. We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation. It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.