Related papers: GSB: Group Superposition Binarization for Vision Transformer with Limited Training Samples

GSB: Group Superposition Binarization for Vision Transformer with Limited Training Samples

URL: http://arxiv.org/abs/2305.07931v4
Date: Thu, 18 Jan 2024 08:22:14 GMT
Title: GSB: Group Superposition Binarization for Vision Transformer with Limited Training Samples
Authors: Tian Gao, Cheng-Zhong Xu, Le Zhang, Hui Kong
Abstract summary: Vision Transformer (ViT) has performed remarkably in various computer vision tasks. ViT usually suffers from serious overfitting problems with a relatively limited number of training samples. We propose a novel model binarization technique, called Group Superposition Binarization (GSB)
Score: 46.025105938192624
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Transformer (ViT) has performed remarkably in various computer vision tasks. Nonetheless, affected by the massive amount of parameters, ViT usually suffers from serious overfitting problems with a relatively limited number of training samples. In addition, ViT generally demands heavy computing resources, which limit its deployment on resource-constrained devices. As a type of model-compression method, model binarization is potentially a good choice to solve the above problems. Compared with the full-precision one, the model with the binarization method replaces complex tensor multiplication with simple bit-wise binary operations and represents full-precision model parameters and activations with only 1-bit ones, which potentially solves the problem of model size and computational complexity, respectively. In this paper, we investigate a binarized ViT model. Empirically, we observe that the existing binarization technology designed for Convolutional Neural Networks (CNN) cannot migrate well to a ViT's binarization task. We also find that the decline of the accuracy of the binary ViT model is mainly due to the information loss of the Attention module and the Value vector. Therefore, we propose a novel model binarization technique, called Group Superposition Binarization (GSB), to deal with these issues. Furthermore, in order to further improve the performance of the binarization model, we have investigated the gradient calculation procedure in the binarization process and derived more proper gradient calculation equations for GSB to reduce the influence of gradient mismatch. Then, the knowledge distillation technique is introduced to alleviate the performance degradation caused by model binarization. Analytically, model binarization can limit the parameters search space during parameter updates while training a model....

Related papers

BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN) We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations. Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z)
Binary Event-Driven Spiking Transformer [36.815359983551986]
Transformer-based Spiking Neural Networks (SNNs) introduce a novel event-driven self-attention paradigm. We propose the Binary Event-Driven Spiking Transformer, i.e. BESTformer. BESTformer suffers from a severe performance drop from its full-precision counterpart due to the limited representation capability of binarization.
arXiv Detail & Related papers (2025-01-10T12:00:11Z)
Research on Personalized Compression Algorithm for Pre-trained Models Based on Homomorphic Entropy Increase [2.6513322539118582]
We explore the challenges and evolution of two key technologies in the current field of AI: Vision Transformer model and Large Language Model (LLM) Vision Transformer captures global information by splitting images into small pieces, but its high reference count and compute overhead limit deployment on mobile devices. LLM has revolutionized natural language processing, but it also faces huge deployment challenges.
arXiv Detail & Related papers (2024-08-16T11:56:49Z)
LORTSAR: Low-Rank Transformer for Skeleton-based Action Recognition [4.375744277719009]
LORTSAR is applied to two leading Transformer-based models, "Hyperformer" and "STEP-CATFormer" Our method can reduce the number of model parameters substantially with negligible degradation or even performance increase in recognition accuracy. This confirms that SVD combined with post-compression fine-tuning can boost model efficiency, paving the way for more sustainable, lightweight, and high-performance technologies in human action recognition.
arXiv Detail & Related papers (2024-07-19T20:19:41Z)
Binarized Diffusion Model for Image Super-Resolution [61.963833405167875]
Binarization, an ultra-compression algorithm, offers the potential for effectively accelerating advanced diffusion models (DMs) Existing binarization methods result in significant performance degradation. We introduce a novel binarized diffusion model, BI-DiffSR, for image SR.
arXiv Detail & Related papers (2024-06-09T10:30:25Z)
FBPT: A Fully Binary Point Transformer [12.373066597900127]
This paper presents a novel Fully Binary Point Cloud Transformer (FBPT) model which has the potential to be widely applied and expanded in the fields of robotics and mobile devices. By compressing the weights and activations of a 32-bit full-precision network to 1-bit binary values, the proposed binary point cloud Transformer network significantly reduces the storage footprint and computational resource requirements. The primary focus of this paper is on addressing the performance degradation issue caused by the use of binary point cloud Transformer modules.
arXiv Detail & Related papers (2024-03-15T03:45:10Z)
VST++: Efficient and Stronger Visual Saliency Transformer [74.26078624363274]
We develop an efficient and stronger VST++ model to explore global long-range dependencies. We evaluate our model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets.
arXiv Detail & Related papers (2023-10-18T05:44:49Z)
End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes [52.818579746354665]
This paper proposes the first end-to-end differentiable meta-BO framework that generalises neural processes to learn acquisition functions via transformer architectures. We enable this end-to-end framework with reinforcement learning (RL) to tackle the lack of labelled acquisition data.
arXiv Detail & Related papers (2023-05-25T10:58:46Z)
BinaryViT: Towards Efficient and Accurate Binary Vision Transformers [4.339315098369913]
Vision Transformers (ViTs) have emerged as the fundamental architecture for most computer vision fields. As one of the most powerful compression methods, binarization reduces the computation of the neural network by quantizing the weights and activation values as $pm$1. Existing binarization methods have demonstrated excellent performance on CNNs, but the full binarization of ViTs is still under-studied and suffering a significant performance drop.
arXiv Detail & Related papers (2023-05-24T05:06:59Z)
BiViT: Extremely Compressed Binary Vision Transformer [19.985314022860432]
We propose to solve two fundamental challenges to push the horizon of Binary Vision Transformers (BiViT) We propose Softmax-aware Binarization, which dynamically adapts to the data distribution and reduces the error caused by binarization. Our method performs favorably against state-of-the-arts by 19.8% on the TinyImageNet dataset.
arXiv Detail & Related papers (2022-11-14T03:36:38Z)
Learning Bounded Context-Free-Grammar via LSTM and the Transformer:Difference and Explanations [51.77000472945441]
Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks. In practice, it is often observed that Transformer models have better representation power than LSTM. We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns.
arXiv Detail & Related papers (2021-12-16T19:56:44Z)
Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence. This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time. Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.