Related papers: BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models

BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models

URL: http://arxiv.org/abs/2306.16678v1
Date: Thu, 29 Jun 2023 04:48:02 GMT
Title: BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models
Authors: Phuoc-Hoan Charles Le, Xinlin Li
Abstract summary: Binarization can be used to help reduce the size of ViT models and their computational cost significantly. ViTs suffer a larger performance drop when directly applying convolutional neural network (CNN) binarization methods. We propose BinaryViT, in which inspired by the CNN architecture, we include operations from the CNN architecture into a pure ViT architecture.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the increasing popularity and the increasing size of vision transformers (ViTs), there has been an increasing interest in making them more efficient and less computationally costly for deployment on edge devices with limited computing resources. Binarization can be used to help reduce the size of ViT models and their computational cost significantly, using popcount operations when the weights and the activations are in binary. However, ViTs suffer a larger performance drop when directly applying convolutional neural network (CNN) binarization methods or existing binarization methods to binarize ViTs compared to CNNs on datasets with a large number of classes such as ImageNet-1k. With extensive analysis, we find that binary vanilla ViTs such as DeiT miss out on a lot of key architectural properties that CNNs have that allow binary CNNs to have much higher representational capability than binary vanilla ViT. Therefore, we propose BinaryViT, in which inspired by the CNN architecture, we include operations from the CNN architecture into a pure ViT architecture to enrich the representational capability of a binary ViT without introducing convolutions. These include an average pooling layer instead of a token pooling layer, a block that contains multiple average pooling branches, an affine transformation right before the addition of each main residual connection, and a pyramid structure. Experimental results on the ImageNet-1k dataset show the effectiveness of these operations that allow a binary pure ViT model to be competitive with previous state-of-the-art (SOTA) binary CNN models.

Related papers

BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN) We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations. Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z)
TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture. To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer. In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z)
Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation [13.753795233064695]
Most domain adaptation (DA) methods are based on either a convolutional neural networks (CNNs) or a vision transformers (ViTs) We design a hybrid method to fully take advantage of both ViT and CNN, called Explicitly Class-specific Boundaries (ECB) ECB learns CNN on ViT to combine their distinct strengths.
arXiv Detail & Related papers (2024-03-27T08:52:44Z)
Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets. We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases. Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z)
Basic Binary Convolution Unit for Binarized Image Restoration Network [146.0988597062618]
In this study, we reconsider components in binary convolution, such as residual connection, BatchNorm, activation function, and structure, for image restoration tasks. Based on our findings and analyses, we design a simple yet efficient basic binary convolution unit (BBCU) Our BBCU significantly outperforms other BNNs and lightweight models, which shows that BBCU can serve as a basic unit for binarized IR networks.
arXiv Detail & Related papers (2022-10-02T01:54:40Z)
Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length. It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity. Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet [128.96032932640364]
We propose a new Tokens-To-Token Vision Transformers (T2T-ViT) to solve vision tasks. T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet. For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% top-1 accuracy on ImageNet.
arXiv Detail & Related papers (2021-01-28T13:25:28Z)
Towards Lossless Binary Convolutional Neural Networks Using Piecewise Approximation [4.023728681102073]
CNNs can significantly reduce the number of arithmetic operations and the size of memory storage. However, the accuracy degradation of single and multiple binary CNNs is unacceptable for modern architectures. We propose a Piecewise Approximation scheme for multiple binary CNNs which lessens accuracy loss by approximating full precision weights and activations.
arXiv Detail & Related papers (2020-08-08T13:32:33Z)
NASB: Neural Architecture Search for Binary Convolutional Neural Networks [2.3204178451683264]
We propose a strategy, named NASB, which adopts Neural Architecture Search (NAS) to find an optimal architecture for the binarization of CNNs. Due to the flexibility of this automated strategy, the obtained architecture is not only suitable for binarization but also has low overhead. NASB outperforms existing single and multiple binary CNNs by up to 4.0% and 1.0% Top-1 accuracy respectively.
arXiv Detail & Related papers (2020-08-08T13:06:11Z)
Binarizing MobileNet via Evolution-based Searching [66.94247681870125]
We propose a use of evolutionary search to facilitate the construction and training scheme when binarizing MobileNet. Inspired by one-shot architecture search frameworks, we manipulate the idea of group convolution to design efficient 1-Bit Convolutional Neural Networks (CNNs) Our objective is to come up with a tiny yet efficient binary neural architecture by exploring the best candidates of the group convolution.
arXiv Detail & Related papers (2020-05-13T13:25:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.