Related papers: Dynamic Layer Tying for Parameter-Efficient Transformers

Dynamic Layer Tying for Parameter-Efficient Transformers

URL: http://arxiv.org/abs/2401.12819v1
Date: Tue, 23 Jan 2024 14:53:20 GMT
Title: Dynamic Layer Tying for Parameter-Efficient Transformers
Authors: Tamir David Hay, Lior Wolf
Abstract summary: We employ Reinforcement Learning to select layers during training and tie them together. This facilitates weight sharing, reduces the number of trainable parameters, and also serves as an effective regularization technique. In particular, the memory consumption during training is up to one order of magnitude less than the conventional training method.
Score: 65.268245109828
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In the pursuit of reducing the number of trainable parameters in deep transformer networks, we employ Reinforcement Learning to dynamically select layers during training and tie them together. Every few iterations, the RL agent is asked whether to train each layer $i$ independently or to copy the weights of a previous layer $j<i$. This facilitates weight sharing, reduces the number of trainable parameters, and also serves as an effective regularization technique. Experimental evaluations validate that our model modestly outperforms the baseline transformer model with regard to perplexity and drastically reduces the number of trainable parameters. In particular, the memory consumption during training is up to one order of magnitude less than the conventional training method.

Related papers

ElastiFormer: Learned Redundancy Reduction in Transformer via Self-Distillation [0.6281017402518722]
ElastiFormer adapts pretrained Transformer models into an elastic counterpart with variable inference time compute. routing modules are trained using self-distillation losses to minimize the differences between the output of the pretrained-model and their elastic counterparts.
arXiv Detail & Related papers (2024-11-22T16:11:14Z)
On the Effectiveness of LayerNorm Tuning for Continual Learning in Vision Transformers [47.77328392236625]
State-of-the-art rehearsal-free continual learning methods exploit the peculiarities of Vision Transformers to learn task-specific prompts. We introduce a two-stage training procedure, where we first optimize the task-specific parameters and then train the classifier with the same selection procedure of the inference time. Our method achieves results that are either superior or on par with the state of the art while being computationally cheaper.
arXiv Detail & Related papers (2023-08-18T15:11:16Z)
Decision S4: Efficient Sequence-Based RL via State Spaces Layers [87.3063565438089]
We present an off-policy training procedure that works with trajectories, while still maintaining the training efficiency of the S4 model. An on-policy training procedure that is trained in a recurrent manner, benefits from long-range dependencies, and is based on a novel stable actor-critic mechanism.
arXiv Detail & Related papers (2023-06-08T13:03:53Z)
Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding [12.030179065286928]
The paper proposes a quantization-aware tensor-compressed training approach to reduce the model size, arithmetic operations, and runtime latency of transformer-based models. A layer-by-layer distillation is applied to distill a quantized and tensor-compressed student model from a pre-trained transformer. The performance is demonstrated in two natural language understanding tasks, showing up to $63times$ compression ratio, little accuracy loss and remarkable inference and training speedup.
arXiv Detail & Related papers (2023-06-01T18:32:08Z)
LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning [82.93130407930762]
It is costly to update the entire parameter set of large pre-trained models. PETL techniques allow updating a small subset of parameters inside a pre-trained backbone network for a new task. We propose Ladder Side-Tuning (LST), a new PETL technique that reduces training memory requirements by more substantial amounts.
arXiv Detail & Related papers (2022-06-13T23:51:56Z)
Dimensionality Reduced Training by Pruning and Freezing Parts of a Deep Neural Network, a Survey [69.3939291118954]
State-of-the-art deep learning models have a parameter count that reaches into the billions. Training, storing and transferring such models is energy and time consuming, thus costly. Model compression lowers storage and transfer costs, and can further make training more efficient by decreasing the number of computations in the forward and/or backward pass. This work is a survey on methods which reduce the number of trained weights in deep learning models throughout the training.
arXiv Detail & Related papers (2022-05-17T05:37:08Z)
Recurrent Stacking of Layers in Neural Networks: An Application to Neural Machine Translation [18.782750537161615]
We propose to share parameters across all layers thereby leading to a recurrently stacked neural network model. We empirically demonstrate that the translation quality of a model that recurrently stacks a single layer 6 times, despite having significantly fewer parameters, approaches that of a model that stacks 6 layers where each layer has different parameters.
arXiv Detail & Related papers (2021-06-18T08:48:01Z)
Layer Reduction: Accelerating Conformer-Based Self-Supervised Model via Layer Consistency [31.572652956170252]
Transformer-based self-supervised models are trained as feature extractors and have empowered many downstream speech tasks to achieve state-of-the-art performance. We experimentally achieve 7.8X parameter reduction, 41.9% training speedup and 37.7% inference speedup while maintaining comparable performance with conventional BERT-like self-supervised methods.
arXiv Detail & Related papers (2021-04-08T08:21:59Z)
Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training. We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.