Deeper Inside Deep ViT
- URL: http://arxiv.org/abs/2508.04181v1
- Date: Wed, 06 Aug 2025 08:08:04 GMT
- Title: Deeper Inside Deep ViT
- Authors: Sungrae Hong,
- Abstract summary: We examine how ViT structure reacts and train in a local environment.<n>We also highlight the instability in training and make some model modifications to stabilize it.<n>We propose an image generation architecture using ViT and investigate which between ViT and ViT-22B is a more suitable structure for image generation.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There have been attempts to create large-scale structures in vision models similar to LLM, such as ViT-22B. While this research has provided numerous analyses and insights, our understanding of its practical utility remains incomplete. Therefore, we examine how this model structure reacts and train in a local environment. We also highlight the instability in training and make some model modifications to stabilize it. The ViT-22B model, trained from scratch, overall outperformed ViT in terms of performance under the same parameter size. Additionally, we venture into the task of image generation, which has not been attempted in ViT-22B. We propose an image generation architecture using ViT and investigate which between ViT and ViT-22B is a more suitable structure for image generation.
Related papers
- A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis [9.687982148528187]
Convolutional Neural Networks (CNNs) are currently among the best texture analysis approaches.
Vision Transformers (ViTs) have been surpassing the performance of CNNs on tasks such as object recognition.
This work explores various pre-trained ViT architectures when transferred to tasks that rely on textures.
arXiv Detail & Related papers (2024-06-10T09:48:13Z) - Scaling Vision Transformers to 22 Billion Parameters [140.67853929168382]
Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been scaled to nearly the same degree.
We present a recipe for highly efficient and stable training of a 22B- parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model.
ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.
arXiv Detail & Related papers (2023-02-10T18:58:21Z) - What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - When Adversarial Training Meets Vision Transformers: Recipes from
Training to Architecture [32.260596998171835]
Adrial training is still required for ViTs to defend against such adversarial attacks.
We find that pre-training and SGD are necessary for ViTs' adversarial training.
Our code is available at https://versa.com/mo666666/When-Adrial-Training-Meets-Vision-Transformers.
arXiv Detail & Related papers (2022-10-14T05:37:20Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z) - Vision Transformer Architecture Search [64.73920718915282]
Current vision transformers (ViTs) are simply inherited from natural language processing (NLP) tasks.
We propose an architecture search method, dubbed ViTAS, to search for the optimal architecture with similar hardware budgets.
Our searched architecture achieves $74.7%$ top-$1$ accuracy on ImageNet and is $2.5%$ superior than the current baseline ViT architecture.
arXiv Detail & Related papers (2021-06-25T15:39:08Z) - Multi-Scale Vision Longformer: A New Vision Transformer for
High-Resolution Image Encoding [81.07894629034767]
This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer.
It significantly enhances the ViT of citedosovitskiy 2020image for encoding high-resolution images using two techniques.
arXiv Detail & Related papers (2021-03-29T06:23:20Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.