Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows
- URL: http://arxiv.org/abs/2507.18405v1
- Date: Thu, 24 Jul 2025 13:45:48 GMT
- Title: Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows
- Authors: Simin Huo, Ning Li,
- Abstract summary: We introduce Iwin Transformer, a novel position-embedding-free hierarchical vision transformer.<n>Iwin Transformer can be fine-tuned directly from low to high resolution, through the collaboration of innovative interleaved window attention and depthwise separable convolution.<n>Iwin Transformer exhibits strong competitiveness in tasks such as image classification, semantic segmentation and video action recognition.
- Score: 1.6893365456432206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Iwin Transformer, a novel position-embedding-free hierarchical vision transformer, which can be fine-tuned directly from low to high resolution, through the collaboration of innovative interleaved window attention and depthwise separable convolution. This approach uses attention to connect distant tokens and applies convolution to link neighboring tokens, enabling global information exchange within a single module, overcoming Swin Transformer's limitation of requiring two consecutive blocks to approximate global attention. Extensive experiments on visual benchmarks demonstrate that Iwin Transformer exhibits strong competitiveness in tasks such as image classification (87.4 top-1 accuracy on ImageNet-1K), semantic segmentation and video action recognition. We also validate the effectiveness of the core component in Iwin as a standalone module that can seamlessly replace the self-attention module in class-conditional image generation. The concepts and methods introduced by the Iwin Transformer have the potential to inspire future research, like Iwin 3D Attention in video generation. The code and models are available at https://github.com/cominder/Iwin-Transformer.
Related papers
- DuoFormer: Leveraging Hierarchical Representations by Local and Global Attention Vision Transformer [1.456352735394398]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs)<n> Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations.<n>These representations are adapted for transformer input through an innovative patch tokenization process, preserving the inherited multi-scale inductive biases.
arXiv Detail & Related papers (2025-06-15T22:42:57Z) - Vision Transformer with Quadrangle Attention [76.35955924137986]
We propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation.
Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles.
We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost.
arXiv Detail & Related papers (2023-03-27T11:13:50Z) - Degenerate Swin to Win: Plain Window-based Transformer without
Sophisticated Operations [36.57766081271396]
A Vision Transformer has a larger receptive field which is capable of characterizing the long-range dependencies.
To boost efficiency, the window-based Vision Transformers emerge.
We check the necessity of the key design element of Swin Transformer, the shifted window partitioning.
arXiv Detail & Related papers (2022-11-25T17:36:20Z) - Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition [158.15602882426379]
This paper does not attempt to design a state-of-the-art method for visual recognition but investigates a more efficient way to make use of convolutions to encode spatial features.
By comparing the design principles of the recent convolutional neural networks ConvNets) and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation.
arXiv Detail & Related papers (2022-11-22T01:39:45Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Iwin: Human-Object Interaction Detection via Transformer with Irregular
Windows [57.00864538284686]
Iwin Transformer is a hierarchical Transformer which progressively performs token representation learning and token agglomeration within irregular windows.
The effectiveness and efficiency of Iwin Transformer are verified on the two standard HOI detection benchmark datasets.
arXiv Detail & Related papers (2022-03-20T12:04:50Z) - Semi-Supervised Vision Transformers [76.83020291497895]
We study the training of Vision Transformers for semi-supervised image classification.
We find Vision Transformers perform poorly on a semi-supervised ImageNet setting.
CNNs achieve superior results in small labeled data regime.
arXiv Detail & Related papers (2021-11-22T09:28:13Z) - CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks.
A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.