ClipFormer: Key-Value Clipping of Transformers on Memristive Crossbars
for Write Noise Mitigation
- URL: http://arxiv.org/abs/2402.02586v1
- Date: Sun, 4 Feb 2024 19:04:37 GMT
- Title: ClipFormer: Key-Value Clipping of Transformers on Memristive Crossbars
for Write Noise Mitigation
- Authors: Abhiroop Bhattacharjee, Abhishek Moitra, and Priyadarshini Panda
- Abstract summary: In-memory computing (IMC) crossbars based on Non-volatile Memories (NVMs) have emerged as a promising solution for accelerating transformers.
We find pre-trained Vision Transformers (ViTs) to be vulnerable on crossbars due to the impact of dynamically generate write noise.
We propose a new memristive crossbar platform to boost the non-ideal accuracies of pre-trained ViT models.
- Score: 6.853523674099236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have revolutionized various real-world applications from natural
language processing to computer vision. However, traditional von-Neumann
computing paradigm faces memory and bandwidth limitations in accelerating
transformers owing to their massive model sizes. To this end, In-memory
Computing (IMC) crossbars based on Non-volatile Memories (NVMs), due to their
ability to perform highly parallelized Matrix-Vector-Multiplications (MVMs)
with high energy-efficiencies, have emerged as a promising solution for
accelerating transformers. However, analog MVM operations in crossbars
introduce non-idealities, such as stochastic read & write noise, which affect
the inference accuracy of the deployed transformers. Specifically, we find
pre-trained Vision Transformers (ViTs) to be vulnerable on crossbars due to the
impact of write noise on the dynamically-generated Key (K) and Value (V)
matrices in the attention layers, an effect not accounted for in prior studies.
We, thus, propose ClipFormer, a transformation on the K and V matrices during
inference, to boost the non-ideal accuracies of pre-trained ViT models.
ClipFormer requires no additional hardware and training overhead and is
amenable to transformers deployed on any memristive crossbar platform. Our
experiments on Imagenet-1k dataset using pre-trained DeiT-S transformers,
subjected to standard training and variation-aware-training, show >10-40%
higher non-ideal accuracies at the high write noise regime by applying
ClipFormer.
Related papers
- MABViT -- Modified Attention Block Enhances Vision Transformers [0.0]
We propose a novel transformer variant that integrates non-linearity within the attention block to tackle this problem.
We implement the GLU-based activation function on the Value tensor, and this new technique surpasses the current state-of-the-art S/16 variant of Vision Transformers by 0.6% on the ImageNet-1K dataset.
arXiv Detail & Related papers (2023-12-03T09:00:31Z) - Optimizing ViViT Training: Time and Memory Reduction for Action
Recognition [30.431334125903145]
We address the challenges posed by the substantial training time and memory consumption associated with video transformers.
Our method is designed to lower this barrier and is based on the idea of freezing the spatial transformer during training.
arXiv Detail & Related papers (2023-06-07T23:06:53Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - Remote Sensing Change Detection With Transformers Trained from Scratch [62.96911491252686]
transformer-based change detection (CD) approaches either employ a pre-trained model trained on large-scale image classification ImageNet dataset or rely on first pre-training on another CD dataset and then fine-tuning on the target benchmark.
We develop an end-to-end CD approach with transformers that is trained from scratch and yet achieves state-of-the-art performance on four public benchmarks.
arXiv Detail & Related papers (2023-04-13T17:57:54Z) - Momentum Transformer: Closing the Performance Gap Between Self-attention
and Its Linearization [31.28396970291575]
Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy.
We first interpret the linear attention and residual connections in computing the attention map as gradient descent steps.
We then introduce momentum into these components and propose the emphmomentum transformer, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities.
arXiv Detail & Related papers (2022-08-01T02:37:49Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - VLDeformer: Learning Visual-Semantic Embeddings by Vision-Language
Transformer Decomposing [7.890230091463883]
Vision-language transformers (VL transformers) have shown impressive accuracy in cross-modal retrieval.
We propose a novel Vision-language Transformer Decomposing (VLDeformer) to modify the VL transformer as an individual encoder for a single image or text.
arXiv Detail & Related papers (2021-10-20T09:00:51Z) - Transformer with a Mixture of Gaussian Keys [31.91701434633319]
Multi-head attention is a driving force behind state-of-the-art transformers.
Transformer-MGK replaces redundant heads in transformers with a mixture of keys at each head.
Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires less FLOPs to compute.
arXiv Detail & Related papers (2021-10-16T23:43:24Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation.
We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters.
A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z) - Spatiotemporal Transformer for Video-based Person Re-identification [102.58619642363958]
We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting.
We propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains.
The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks.
arXiv Detail & Related papers (2021-03-30T16:19:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.