Recurrent Stacking of Layers in Neural Networks: An Application to
Neural Machine Translation
- URL: http://arxiv.org/abs/2106.10002v1
- Date: Fri, 18 Jun 2021 08:48:01 GMT
- Title: Recurrent Stacking of Layers in Neural Networks: An Application to
Neural Machine Translation
- Authors: Raj Dabre and Atsushi Fujita
- Abstract summary: We propose to share parameters across all layers thereby leading to a recurrently stacked neural network model.
We empirically demonstrate that the translation quality of a model that recurrently stacks a single layer 6 times, despite having significantly fewer parameters, approaches that of a model that stacks 6 layers where each layer has different parameters.
- Score: 18.782750537161615
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In deep neural network modeling, the most common practice is to stack a
number of recurrent, convolutional, or feed-forward layers in order to obtain
high-quality continuous space representations which in turn improves the
quality of the network's prediction. Conventionally, each layer in the stack
has its own parameters which leads to a significant increase in the number of
model parameters. In this paper, we propose to share parameters across all
layers thereby leading to a recurrently stacked neural network model. We report
on an extensive case study on neural machine translation (NMT), where we apply
our proposed method to an encoder-decoder based neural network model, i.e., the
Transformer model, and experiment with three Japanese--English translation
datasets. We empirically demonstrate that the translation quality of a model
that recurrently stacks a single layer 6 times, despite having significantly
fewer parameters, approaches that of a model that stacks 6 layers where each
layer has different parameters. We also explore the limits of recurrent
stacking where we train extremely deep NMT models. This paper also examines the
utility of our recurrently stacked model as a student model through transfer
learning via leveraging pre-trained parameters and knowledge distillation, and
shows that it compensates for the performance drops in translation quality that
the direct training of recurrently stacked model brings. We also show how
transfer learning helps in faster decoding on top of the already reduced number
of parameters due to recurrent stacking. Finally, we analyze the effects of
recurrently stacked layers by visualizing the attentions of models that use
recurrently stacked layers and models that do not.
Related papers
- Variational autoencoder-based neural network model compression [4.992476489874941]
Variational Autoencoders (VAEs), as a form of deep generative model, have been widely used in recent years.
This paper aims to explore neural network model compression method based on VAE.
arXiv Detail & Related papers (2024-08-25T09:06:22Z) - Towards Scalable and Versatile Weight Space Learning [51.78426981947659]
This paper introduces the SANE approach to weight-space learning.
Our method extends the idea of hyper-representations towards sequential processing of subsets of neural network weights.
arXiv Detail & Related papers (2024-06-14T13:12:07Z) - Layer-wise Linear Mode Connectivity [52.6945036534469]
Averaging neural network parameters is an intuitive method for the knowledge of two independent models.
It is most prominently used in federated learning.
We analyse the performance of the models that result from averaging single, or groups.
arXiv Detail & Related papers (2023-07-13T09:39:10Z) - Improved Convergence Guarantees for Shallow Neural Networks [91.3755431537592]
We prove convergence of depth 2 neural networks, trained via gradient descent, to a global minimum.
Our model has the following features: regression with quadratic loss function, fully connected feedforward architecture, RelU activations, Gaussian data instances, adversarial labels.
They strongly suggest that, at least in our model, the convergence phenomenon extends well beyond the NTK regime''
arXiv Detail & Related papers (2022-12-05T14:47:52Z) - NAR-Former: Neural Architecture Representation Learning towards Holistic
Attributes Prediction [37.357949900603295]
We propose a neural architecture representation model that can be used to estimate attributes holistically.
Experiment results show that our proposed framework can be used to predict the latency and accuracy attributes of both cell architectures and whole deep neural networks.
arXiv Detail & Related papers (2022-11-15T10:15:21Z) - Learning to Learn with Generative Models of Neural Network Checkpoints [71.06722933442956]
We construct a dataset of neural network checkpoints and train a generative model on the parameters.
We find that our approach successfully generates parameters for a wide range of loss prompts.
We apply our method to different neural network architectures and tasks in supervised and reinforcement learning.
arXiv Detail & Related papers (2022-09-26T17:59:58Z) - Entropy optimized semi-supervised decomposed vector-quantized
variational autoencoder model based on transfer learning for multiclass text
classification and generation [3.9318191265352196]
We propose a semisupervised discrete latent variable model for multi-class text classification and text generation.
The proposed model employs the concept of transfer learning for training a quantized transformer model.
Experimental results indicate that the proposed model has surpassed the state-of-the-art models remarkably.
arXiv Detail & Related papers (2021-11-10T07:07:54Z) - Train your classifier first: Cascade Neural Networks Training from upper
layers to lower layers [54.47911829539919]
We develop a novel top-down training method which can be viewed as an algorithm for searching for high-quality classifiers.
We tested this method on automatic speech recognition (ASR) tasks and language modelling tasks.
The proposed method consistently improves recurrent neural network ASR models on Wall Street Journal, self-attention ASR models on Switchboard, and AWD-LSTM language models on WikiText-2.
arXiv Detail & Related papers (2021-02-09T08:19:49Z) - On the Sparsity of Neural Machine Translation Models [65.49762428553345]
We investigate whether redundant parameters can be reused to achieve better performance.
Experiments and analyses are systematically conducted on different datasets and NMT architectures.
arXiv Detail & Related papers (2020-10-06T11:47:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.