Performance Evaluation of Swin Vision Transformer Model using Gradient
Accumulation Optimization Technique
- URL: http://arxiv.org/abs/2308.00197v1
- Date: Mon, 31 Jul 2023 23:30:16 GMT
- Title: Performance Evaluation of Swin Vision Transformer Model using Gradient
Accumulation Optimization Technique
- Authors: Sanad Aburass and Osama Dorgham
- Abstract summary: This paper evaluates the performance of Swin ViT model using gradient accumulation optimization technique.
Applying the GAO technique leads to a significant decrease in the accuracy of the Swin ViT model.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViTs) have emerged as a promising approach for visual
recognition tasks, revolutionizing the field by leveraging the power of
transformer-based architectures. Among the various ViT models, Swin
Transformers have gained considerable attention due to their hierarchical
design and ability to capture both local and global visual features
effectively. This paper evaluates the performance of Swin ViT model using
gradient accumulation optimization (GAO) technique. We investigate the impact
of gradient accumulation optimization technique on the model's accuracy and
training time. Our experiments show that applying the GAO technique leads to a
significant decrease in the accuracy of the Swin ViT model, compared to the
standard Swin Transformer model. Moreover, we detect a significant increase in
the training time of the Swin ViT model when GAO model is applied. These
findings suggest that applying the GAO technique may not be suitable for the
Swin ViT model, and concern should be undertaken when using GAO technique for
other transformer-based models.
Related papers
- Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry [1.2289361708127877]
We propose a causal visual-inertial fusion transformer (VIFT) for pose estimation in deep visual-inertial odometry.
The proposed method is end-to-end trainable and requires only a monocular camera and IMU during inference.
arXiv Detail & Related papers (2024-09-13T12:21:25Z) - TransAxx: Efficient Transformers with Approximate Computing [4.347898144642257]
Vision Transformer (ViT) models have shown to be very competitive and often become a popular alternative to Convolutional Neural Networks (CNNs)
We propose TransAxx, a framework based on the popular PyTorch library that enables fast inherent support for approximate arithmetic.
Our approach uses a Monte Carlo Tree Search (MCTS) algorithm to efficiently search the space of possible configurations.
arXiv Detail & Related papers (2024-02-12T10:16:05Z) - VST++: Efficient and Stronger Visual Saliency Transformer [74.26078624363274]
We develop an efficient and stronger VST++ model to explore global long-range dependencies.
We evaluate our model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets.
arXiv Detail & Related papers (2023-10-18T05:44:49Z) - ViT2EEG: Leveraging Hybrid Pretrained Vision Transformers for EEG Data [0.0]
We demonstrate the application of a hybrid Vision Transformer (ViT) model, pretrained on ImageNet, on an electroencephalogram (EEG) regression task.
This model shows a notable increase in performance compared to other models, including an identical architecture ViT trained without the ImageNet weights.
arXiv Detail & Related papers (2023-08-01T11:10:33Z) - Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design [84.34416126115732]
Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration.
We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers.
Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute.
arXiv Detail & Related papers (2023-05-22T13:39:28Z) - Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications.
The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate.
There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z) - Transformers For Recognition In Overhead Imagery: A Reality Check [0.0]
We compare the impact of adding transformer structures into state-of-the-art segmentation models for overhead imagery.
Our results suggest that transformers provide consistent, but modest, performance improvements.
arXiv Detail & Related papers (2022-10-23T02:17:31Z) - Depth Estimation with Simplified Transformer [4.565830918989131]
Transformer and its variants have shown state-of-the-art results in many vision tasks recently.
We propose a method for self-supervised monocular Depth Estimation with simplified Transformer (DEST)
Our model leads to significant reduction in model size, complexity, as well as inference latency, while achieving superior accuracy as compared to state-of-the-art.
arXiv Detail & Related papers (2022-04-28T21:39:00Z) - Improving Vision Transformers by Revisiting High-frequency Components [106.7140968644414]
We show that Vision Transformer (ViT) models are less effective in capturing the high-frequency components of images than CNN models.
To compensate, we propose HAT, which directly augments high-frequency components of images via adversarial training.
We show that HAT can consistently boost the performance of various ViT models.
arXiv Detail & Related papers (2022-04-03T05:16:51Z) - Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks.
Although the network performance is boosted, transformers are often required more computational resources.
We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.