SparseSwin: Swin Transformer with Sparse Transformer Block
- URL: http://arxiv.org/abs/2309.05224v1
- Date: Mon, 11 Sep 2023 04:03:43 GMT
- Title: SparseSwin: Swin Transformer with Sparse Transformer Block
- Authors: Krisna Pinasthika, Blessius Sheldo Putra Laksono, Riyandi Banovbi
Putera Irsal, Syifa Hukma Shabiyya, Novanto Yudistira
- Abstract summary: This paper aims to reduce the number of parameters and in turn, made the transformer more efficient.
We present Sparse Transformer (SparTa) Block, a modified transformer block with an addition of a sparse token converter.
The proposed SparseSwin model outperforms other state of the art models in image classification with an accuracy of 86.96%, 97.43%, and 85.35% on the ImageNet100, CIFAR10, and CIFAR100 datasets respectively.
- Score: 1.7243216387069678
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Advancements in computer vision research have put transformer architecture as
the state of the art in computer vision tasks. One of the known drawbacks of
the transformer architecture is the high number of parameters, this can lead to
a more complex and inefficient algorithm. This paper aims to reduce the number
of parameters and in turn, made the transformer more efficient. We present
Sparse Transformer (SparTa) Block, a modified transformer block with an
addition of a sparse token converter that reduces the number of tokens used. We
use the SparTa Block inside the Swin T architecture (SparseSwin) to leverage
Swin capability to downsample its input and reduce the number of initial tokens
to be calculated. The proposed SparseSwin model outperforms other state of the
art models in image classification with an accuracy of 86.96%, 97.43%, and
85.35% on the ImageNet100, CIFAR10, and CIFAR100 datasets respectively. Despite
its fewer parameters, the result highlights the potential of a transformer
architecture using a sparse token converter with a limited number of tokens to
optimize the use of the transformer and improve its performance.
Related papers
- Transformer based Pluralistic Image Completion with Reduced Information Loss [72.92754600354199]
Transformer based methods have achieved great success in image inpainting recently.
They regard each pixel as a token, thus suffering from an information loss issue.
We propose a new transformer based framework called "PUT"
arXiv Detail & Related papers (2024-03-31T01:20:16Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - ByteTransformer: A High-Performance Transformer Boosted for
Variable-Length Inputs [6.9136984255301]
We present ByteTransformer, a high-performance transformer boosted for variable-length inputs.
ByteTransformer surpasses the state-of-the-art Transformer frameworks, such as PyTorch JIT, XLA, Tencent TurboTransformer and NVIDIA FasterTransformer.
arXiv Detail & Related papers (2022-10-06T16:57:23Z) - SSformer: A Lightweight Transformer for Semantic Segmentation [7.787950060560868]
Swin Transformer set a new record in various vision tasks by using hierarchical architecture and shifted windows.
We design a lightweight yet effective transformer model, called SSformer.
Experimental results show the proposed SSformer yields comparable mIoU performance with state-of-the-art models.
arXiv Detail & Related papers (2022-08-03T12:57:00Z) - Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot
Segmentation [58.4650849317274]
Volumetric Aggregation with Transformers (VAT) is a cost aggregation network for few-shot segmentation.
VAT attains state-of-the-art performance for semantic correspondence as well, where cost aggregation also plays a central role.
arXiv Detail & Related papers (2022-07-22T04:10:30Z) - ATS: Adaptive Token Sampling For Efficient Vision Transformers [33.297806854292155]
We introduce a differentiable parameter-free Adaptive Token Sampling (ATS) module, which can be plugged into any existing vision transformer architecture.
ATS empowers vision transformers by scoring and adaptively sampling significant tokens.
Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by 37% while preserving the accuracy.
arXiv Detail & Related papers (2021-11-30T18:56:57Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - Incorporating Convolution Designs into Visual Transformers [24.562955955312187]
We propose a new textbfConvolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies.
Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers.
arXiv Detail & Related papers (2021-03-22T13:16:12Z) - Transformer in Transformer [59.066686278998354]
We propose a novel Transformer-iN-Transformer (TNT) model for modeling both patch-level and pixel-level representation.
Our TNT achieves $81.3%$ top-1 accuracy on ImageNet which is $1.5%$ higher than that of DeiT with similar computational cost.
arXiv Detail & Related papers (2021-02-27T03:12:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.