Demystifying Local Vision Transformer: Sparse Connectivity, Weight
Sharing, and Dynamic Weight
- URL: http://arxiv.org/abs/2106.04263v1
- Date: Tue, 8 Jun 2021 11:47:44 GMT
- Title: Demystifying Local Vision Transformer: Sparse Connectivity, Weight
Sharing, and Dynamic Weight
- Authors: Qi Han, Zejia Fan, Qi Dai, Lei Sun, Ming-Ming Cheng, Jiaying Liu,
Jingdong Wang
- Abstract summary: Local Vision Transformer (ViT) attains state-of-the-art performance in visual recognition.
We analyze local attention as a channel-wise locally-connected layer.
We empirically observe that the models based on depth-wise convolution and the dynamic variant with lower complexity perform on-par with or sometimes slightly better than Swin Transformer.
- Score: 114.03127079555456
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformer (ViT) attains state-of-the-art performance in visual
recognition, and the variant, Local Vision Transformer, makes further
improvements. The major component in Local Vision Transformer, local attention,
performs the attention separately over small local windows. We rephrase local
attention as a channel-wise locally-connected layer and analyze it from two
network regularization manners, sparse connectivity and weight sharing, as well
as weight computation. Sparse connectivity: there is no connection across
channels, and each position is connected to the positions within a small local
window. Weight sharing: the connection weights for one position are shared
across channels or within each group of channels. Dynamic weight: the
connection weights are dynamically predicted according to each image instance.
We point out that local attention resembles depth-wise convolution and its
dynamic version in sparse connectivity. The main difference lies in weight
sharing - depth-wise convolution shares connection weights (kernel weights)
across spatial positions. We empirically observe that the models based on
depth-wise convolution and the dynamic variant with lower computation
complexity perform on-par with or sometimes slightly better than Swin
Transformer, an instance of Local Vision Transformer, for ImageNet
classification, COCO object detection and ADE semantic segmentation. These
observations suggest that Local Vision Transformer takes advantage of two
regularization forms and dynamic weight to increase the network capacity.
Related papers
- B-cos Alignment for Inherently Interpretable CNNs and Vision
Transformers [97.75725574963197]
We present a new direction for increasing the interpretability of deep neural networks (DNNs) by promoting weight-input alignment during training.
We show that a sequence of such transformations induces a single linear transformation that faithfully summarises the full model computations.
We show that the resulting explanations are of high visual quality and perform well under quantitative interpretability metrics.
arXiv Detail & Related papers (2023-06-19T12:54:28Z) - DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets [95.84755169585492]
We present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception.
Our model achieves state-of-the-art performance with a broad range of 3D perception tasks.
arXiv Detail & Related papers (2023-01-15T09:31:58Z) - Dual Complementary Dynamic Convolution for Image Recognition [13.864357201410648]
We propose a novel two-branch dual complementary dynamic convolution (DCDC) operator for convolutional neural networks (CNNs)
The DCDC operator overcomes the limitations of vanilla convolution and most existing dynamic convolutions who capture only spatial-adaptive features.
Experiments show that the DCDC operator based ResNets (DCDC-ResNets) significantly outperform vanilla ResNets and most state-of-the-art dynamic convolutional networks on image classification.
arXiv Detail & Related papers (2022-11-11T12:32:12Z) - B-cos Networks: Alignment is All We Need for Interpretability [136.27303006772294]
We present a new direction for increasing the interpretability of deep neural networks (DNNs) by promoting weight-input alignment during training.
A B-cos transform induces a single linear transform that faithfully summarises the full model computations.
We show that it can easily be integrated into common models such as VGGs, ResNets, InceptionNets, and DenseNets.
arXiv Detail & Related papers (2022-05-20T16:03:29Z) - Transformer-Guided Convolutional Neural Network for Cross-View
Geolocalization [20.435023745201878]
We propose a novel Transformer-guided convolutional neural network (TransGCNN) architecture.
Our TransGCNN consists of a CNN backbone extracting feature map from an input image and a Transformer head modeling global context.
Experiments on popular benchmark datasets demonstrate that our model achieves top-1 accuracy of 94.12% and 84.92% on CVUSA and CVACT_val, respectively.
arXiv Detail & Related papers (2022-04-21T08:46:41Z) - Boosting Salient Object Detection with Transformer-based Asymmetric
Bilateral U-Net [19.21709807149165]
Existing salient object detection (SOD) methods mainly rely on U-shaped convolution neural networks (CNNs) with skip connections.
We propose a transformer-based Asymmetric Bilateral U-Net (ABiU-Net) to learn both global and local representations for SOD.
ABiU-Net performs favorably against previous state-of-the-art SOD methods.
arXiv Detail & Related papers (2021-08-17T19:45:28Z) - Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations.
We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation.
In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z) - LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.
We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network.
This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.