Related papers: Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models

Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models

URL: http://arxiv.org/abs/2403.09635v2
Date: Thu, 18 Jul 2024 17:59:35 GMT
Title: Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models
Authors: Akhil Kedia, Mohd Abbas Zaidi, Sushil Khyalia, Jungho Jung, Harshith Goka, Haejun Lee,
Abstract summary: We develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model. Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores.
Score: 6.809572275782338
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In spite of their huge success, transformer models remain difficult to scale in depth. In this work, we develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model. Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores. We also propose DeepScaleLM, an initialization and scaling scheme that conserves unit output/gradient moments throughout the model, enabling the training of very deep models with 1000 layers. We find that transformer models could be much deeper - our deep models with fewer parameters outperform shallow models in Language Modeling, Speech Translation, and Image Classification, across encoder-only, decoder-only and encoder-decoder variants, for both Pre-LN and Post-LN transformers, for multiple datasets and model sizes. These improvements also translate into improved performance on downstream Question Answering tasks and improved robustness for Image Classification.

Related papers

Making the Most of your Model: Methods for Finetuning and Applying Pretrained Transformers [0.21756081703276003]
This thesis provides methods and analysis of models which make progress on this goal. We introduce two new finetuning methods which add new capabilities to the models they are used on. We provide theoretical and empirical insights on the divergence of model-likelihood and output quality.
arXiv Detail & Related papers (2024-08-29T03:50:24Z)
Freely Long-Thinking Transformer (FraiLT) [0.0]
Freely Long-Thinking Transformer (FraiLT) is an improved transformer model designed to enhance processing capabilities without scaling up size. FraiLT utilizes a recursion approach, iterating over a subset of layers multiple times, and introduces iteration encodings to maintain awareness across these cycles. When evaluated on a synthetic story dataset, FraiLT outperformed larger models, showcasing its ability to deliver high-quality performance while reducing memory demands.
arXiv Detail & Related papers (2024-01-21T23:37:33Z)
Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding [12.030179065286928]
The paper proposes a quantization-aware tensor-compressed training approach to reduce the model size, arithmetic operations, and runtime latency of transformer-based models. A layer-by-layer distillation is applied to distill a quantized and tensor-compressed student model from a pre-trained transformer. The performance is demonstrated in two natural language understanding tasks, showing up to $63times$ compression ratio, little accuracy loss and remarkable inference and training speedup.
arXiv Detail & Related papers (2023-06-01T18:32:08Z)
I3D: Transformer architectures with input-dependent dynamic depth for speech recognition [41.35563331283372]
We propose a novel Transformer encoder with Input-Dependent Dynamic Depth (I3D) to achieve strong performance-efficiency trade-offs. We also present interesting analysis on the gate probabilities and the input-dependency, which helps us better understand deep encoders.
arXiv Detail & Related papers (2023-03-14T04:47:00Z)
Regularizing Transformers With Deep Probabilistic Layers [62.997667081978825]
In this work, we demonstrate how the inclusion of deep generative models within BERT can bring more versatile models. We prove its effectiveness not only in Transformers but also in the most relevant encoder-decoder based LM, seq2seq with and without attention.
arXiv Detail & Related papers (2021-08-23T10:17:02Z)
Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer' With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
Multiscale Vision Transformers [79.76412415996892]
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks.
arXiv Detail & Related papers (2021-04-22T17:59:45Z)
Parameter Efficient Multimodal Transformers for Video Representation Learning [108.8517364784009]
This work focuses on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning. We show that our approach reduces parameters up to 80$%$, allowing us to train our model end-to-end from scratch. To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.
arXiv Detail & Related papers (2020-12-08T00:16:13Z)
TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech [63.03318307254081]
TERA stands for Transformer Representations from Alteration. We use alteration along three axes to pre-train Transformers on a large amount of unlabeled speech. TERA can be used for speech representations extraction or fine-tuning with downstream models.
arXiv Detail & Related papers (2020-07-12T16:19:00Z)
Learning to Encode Position for Transformer with Continuous Dynamical Model [88.69870971415591]
We introduce a new way of learning to encode position information for non-recurrent models, such as Transformer models. We model the evolution of encoded results along position index by such a dynamical system.
arXiv Detail & Related papers (2020-03-13T00:41:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.