GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys,
and Values
- URL: http://arxiv.org/abs/2311.03426v2
- Date: Wed, 13 Dec 2023 16:57:19 GMT
- Title: GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys,
and Values
- Authors: Farnoosh Javadi, Walid Ahmed, Habib Hajimolahoseini, Foozhan
Ataiefard, Mohammad Hassanpour, Saina Asani, Austin Wen, Omar Mohamed Awad,
Kangling Liu, Yang Liu
- Abstract summary: GQKVA is designed to speed up transformer pre-training while reducing the model size.
Our experiments with various GQKVA variants highlight a clear trade-off between performance and model size.
- Score: 3.960622297616708
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Massive transformer-based models face several challenges, including slow and
computationally intensive pre-training and over-parametrization. This paper
addresses these challenges by proposing a versatile method called GQKVA, which
generalizes query, key, and value grouping techniques. GQKVA is designed to
speed up transformer pre-training while reducing the model size. Our
experiments with various GQKVA variants highlight a clear trade-off between
performance and model size, allowing for customized choices based on resource
and time limitations. Our findings also indicate that the conventional
multi-head attention approach is not always the best choice, as there are
lighter and faster alternatives available. We tested our method on ViT, which
achieved an approximate 0.3% increase in accuracy while reducing the model size
by about 4% in the task of image classification. Additionally, our most
aggressive model reduction experiment resulted in a reduction of approximately
15% in model size, with only around a 1% drop in accuracy.
Related papers
- VTrans: Accelerating Transformer Compression with Variational Information Bottleneck based Pruning [3.256420760342604]
We propose VTrans, an iterative pruning framework guided by the Variational Information Bottleneck (VIB) principle.
Our method compresses all structural components, including embeddings, attention heads, and layers using VIB-trained masks.
Notably, our method achieves upto 70% more compression than prior state-of-the-art approaches.
arXiv Detail & Related papers (2024-06-07T22:07:46Z) - FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search [50.07268323597872]
We propose the first one-shot mixed-precision quantization search that eliminates the need for retraining in both integer and low-precision floating point models.
With integer models, we increase the accuracy of ResNet-18 on ImageNet by 1.31% and ResNet-50 by 0.90% with equivalent model cost over previous methods.
For the first time, we explore a novel mixed-precision floating-point search and improve MobileNetV2 by up to 0.98% compared to prior state-of-the-art FP8 models.
arXiv Detail & Related papers (2023-08-07T04:17:19Z) - Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design [84.34416126115732]
Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration.
We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers.
Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute.
arXiv Detail & Related papers (2023-05-22T13:39:28Z) - GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous
Structured Pruning for Vision Transformer [76.2625311630021]
Vision transformers (ViTs) have shown very impressive empirical performance in various computer vision tasks.
To mitigate this challenging problem, structured pruning is a promising solution to compress model size and enable practical efficiency.
We propose GOHSP, a unified framework of Graph and Optimization-based Structured Pruning for ViT models.
arXiv Detail & Related papers (2023-01-13T00:40:24Z) - Megapixel Image Generation with Step-Unrolled Denoising Autoencoders [5.145313322824774]
We propose a combination of techniques to push sample resolutions higher and reduce computational requirements for training and sampling.
These include vector-quantized GAN (VQ-GAN), a vector-quantization (VQ) model capable of high levels of lossy - but perceptually insignificant - compression; hourglass transformers, a highly scaleable self-attention model; and step-unrolled denoising autoencoders (SUNDAE), a non-autoregressive (NAR) text generative model.
Our proposed framework scales to high-resolutions ($1024 times 1024$) and trains quickly (
arXiv Detail & Related papers (2022-06-24T15:47:42Z) - Model soups: averaging weights of multiple fine-tuned models improves
accuracy without increasing inference time [69.7693300927423]
We show that averaging the weights of multiple models fine-tuned with different hyper parameter configurations improves accuracy and robustness.
We show that the model soup approach extends to multiple image classification and natural language processing tasks.
arXiv Detail & Related papers (2022-03-10T17:03:49Z) - Multi-Dimensional Model Compression of Vision Transformer [21.8311401851523]
Vision transformers (ViT) have recently attracted considerable attentions, but the huge computational cost remains an issue for practical deployment.
Previous ViT pruning methods tend to prune the model along one dimension solely.
We advocate a multi-dimensional ViT compression paradigm, and propose to harness the redundancy reduction from attention head, neuron and sequence dimensions jointly.
arXiv Detail & Related papers (2021-12-31T19:54:18Z) - AxFormer: Accuracy-driven Approximation of Transformers for Faster,
Smaller and more Accurate NLP Models [4.247712017691596]
AxFormer is a framework that applies accuracy-driven approximations to create optimized transformer models for a given downstream task.
Our experiments show that AxFormer models are up to 4.5% more accurate, while also being up to 2.5X faster and up to 3.2X smaller than conventional fine-tuned models.
arXiv Detail & Related papers (2020-10-07T23:29:34Z) - The Cascade Transformer: an Application for Efficient Answer Sentence
Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers.
When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.