Head-Free Lightweight Semantic Segmentation with Linear Transformer
- URL: http://arxiv.org/abs/2301.04648v1
- Date: Wed, 11 Jan 2023 18:59:46 GMT
- Title: Head-Free Lightweight Semantic Segmentation with Linear Transformer
- Authors: Bo Dong and Pichao Wang and Fan Wang
- Abstract summary: We propose a head-free lightweight architecture specifically for semantic segmentation, named Adaptive Frequency Transformer.
It adopts a parallel architecture to leverage prototype representations as specific learnable local descriptions which replaces the decoder.
Although removing the decoder compresses most of the computation, the accuracy of the parallel structure is still limited by low computational resources.
- Score: 21.38163906180886
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing semantic segmentation works have been mainly focused on designing
effective decoders; however, the computational load introduced by the overall
structure has long been ignored, which hinders their applications on
resource-constrained hardwares. In this paper, we propose a head-free
lightweight architecture specifically for semantic segmentation, named Adaptive
Frequency Transformer. It adopts a parallel architecture to leverage prototype
representations as specific learnable local descriptions which replaces the
decoder and preserves the rich image semantics on high-resolution features.
Although removing the decoder compresses most of the computation, the accuracy
of the parallel structure is still limited by low computational resources.
Therefore, we employ heterogeneous operators (CNN and Vision Transformer) for
pixel embedding and prototype representations to further save computational
costs. Moreover, it is very difficult to linearize the complexity of the vision
Transformer from the perspective of spatial domain. Due to the fact that
semantic segmentation is very sensitive to frequency information, we construct
a lightweight prototype learning block with adaptive frequency filter of
complexity $O(n)$ to replace standard self attention with $O(n^{2})$. Extensive
experiments on widely adopted datasets demonstrate that our model achieves
superior accuracy while retaining only 3M parameters. On the ADE20K dataset,
our model achieves 41.8 mIoU and 4.6 GFLOPs, which is 4.4 mIoU higher than
Segformer, with 45% less GFLOPs. On the Cityscapes dataset, our model achieves
78.7 mIoU and 34.4 GFLOPs, which is 2.5 mIoU higher than Segformer with 72.5%
less GFLOPs. Code is available at https://github.com/dongbo811/AFFormer.
Related papers
- MobileUNETR: A Lightweight End-To-End Hybrid Vision Transformer For Efficient Medical Image Segmentation [0.12499537119440242]
Skin cancer segmentation poses a significant challenge in medical image analysis.
MobileUNETR aims to overcome the performance constraints associated with both CNNs and Transformers.
MobileUNETR achieves superior performance with 3 million parameters and a computational complexity of 1.3 GFLOP.
arXiv Detail & Related papers (2024-09-04T20:23:37Z) - Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions.
Mask2Former uses 50% of its compute only on the transformer encoder.
This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer.
We propose PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance.
arXiv Detail & Related papers (2024-04-23T01:34:20Z) - ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding [9.144813021145039]
This paper introduces ParFormer, a vision transformer that incorporates a Parallel Mixer and a Sparse Channel Attention Patch Embedding (SCAPE)
ParFormer improves feature extraction by combining convolutional and attention mechanisms.
For edge device deployment, ParFormer-T excels with a throughput of 278.1 images/sec, which is 1.38 $times$ higher than EdgeNeXt-S.
The larger variant, ParFormer-L, reaches 83.5% Top-1 accuracy, offering a balanced trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2024-03-22T07:32:21Z) - Low-Resolution Self-Attention for Semantic Segmentation [96.81482872022237]
We introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost.
Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image's resolution.
We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure.
arXiv Detail & Related papers (2023-10-08T06:10:09Z) - SegViTv2: Exploring Efficient and Continual Semantic Segmentation with
Plain Vision Transformers [76.13755422671822]
This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder-decoder framework.
We introduce a novel Attention-to-Mask (atm) module to design a lightweight decoder effective for plain ViT.
Our decoder outperforms the popular decoder UPerNet using various ViT backbones while consuming only about $5%$ of the computational cost.
arXiv Detail & Related papers (2023-06-09T22:29:56Z) - MUSTER: A Multi-scale Transformer-based Decoder for Semantic Segmentation [19.83103856355554]
MUSTER is a transformer-based decoder that seamlessly integrates with hierarchical encoders.
MSKA units enable the fusion of multi-scale features from the encoder and decoder, facilitating comprehensive information integration.
On the challenging ADE20K dataset, our best model achieves a single-scale mIoU of 50.23 and a multi-scale mIoU of 51.88.
arXiv Detail & Related papers (2022-11-25T06:51:07Z) - Efficiently Scaling Transformer Inference [8.196193683641582]
We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings.
We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices.
We achieve a low-batch-size latency of 29ms per token during generation (using int8 weight quantization) and a 76% MFU during large-batch-size processing of input tokens.
arXiv Detail & Related papers (2022-11-09T18:50:38Z) - Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision.
We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture.
Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z) - SegFormer: Simple and Efficient Design for Semantic Segmentation with
Transformers [79.646577541655]
We present SegFormer, a semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders.
SegFormer comprises a novelly structured encoder which outputs multiscale features.
The proposed decoder aggregates information from different layers, and thus combining both local attention and global attention to powerful representations.
arXiv Detail & Related papers (2021-05-31T17:59:51Z) - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective
with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches.
With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR)
SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z) - CoDeNet: Efficient Deployment of Input-Adaptive Object Detection on
Embedded FPGAs [41.43273142203345]
We harness the flexibility of FPGAs to develop a novel object detection pipeline with deformable convolutions.
With our high-efficiency implementation, our solution reaches 26.9 frames per second with a tiny model size of 0.76 MB.
Our model gets to 67.1 AP50 on Pascal VOC with only 2.9 MB of parameters-20.9x smaller but 10% more accurate than Tiny-YOLO.
arXiv Detail & Related papers (2020-06-12T17:56:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.