Composable Function-preserving Expansions for Transformer Architectures
- URL: http://arxiv.org/abs/2308.06103v1
- Date: Fri, 11 Aug 2023 12:27:22 GMT
- Title: Composable Function-preserving Expansions for Transformer Architectures
- Authors: Andrea Gesmundo and Kaitlin Maile
- Abstract summary: Training state-of-the-art neural networks requires a high cost in terms of compute and time.
We propose six composable transformations to incrementally increase the size of transformer-based neural networks.
- Score: 2.579908688646812
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training state-of-the-art neural networks requires a high cost in terms of
compute and time. Model scale is recognized to be a critical factor to achieve
and improve the state-of-the-art. Increasing the scale of a neural network
normally requires restarting from scratch by randomly initializing all the
parameters of the model, as this implies a change of architecture's parameters
that does not allow for a straightforward transfer of knowledge from smaller
size models. In this work, we propose six composable transformations to
incrementally increase the size of transformer-based neural networks while
preserving functionality, allowing to expand the capacity of the model as
needed. We provide proof of exact function preservation under minimal
initialization constraints for each transformation. The proposed methods may
enable efficient training pipelines for larger and more powerful models by
progressively expanding the architecture throughout training.
Related papers
- Autoregressive + Chain of Thought = Recurrent: Recurrence's Role in Language Models' Computability and a Revisit of Recurrent Transformer [29.970200877158764]
We investigate the influence of recurrent structures in neural models on their reasoning abilities and computability.
We shed light on how the CoT approach can mimic recurrent computation and act as a bridge between autoregression and recurrence.
arXiv Detail & Related papers (2024-09-14T00:30:57Z) - Principled Architecture-aware Scaling of Hyperparameters [69.98414153320894]
Training a high-quality deep neural network requires choosing suitable hyperparameters, which is a non-trivial and expensive process.
In this work, we precisely characterize the dependence of initializations and maximal learning rates on the network architecture.
We demonstrate that network rankings can be easily changed by better training networks in benchmarks.
arXiv Detail & Related papers (2024-02-27T11:52:49Z) - Symplectic Autoencoders for Model Reduction of Hamiltonian Systems [0.0]
It is crucial to preserve the symplectic structure associated with the system in order to ensure long-term numerical stability.
We propose a new neural network architecture in the spirit of autoencoders, which are established tools for dimension reduction.
In order to train the network, a non-standard gradient descent approach is applied.
arXiv Detail & Related papers (2023-12-15T18:20:25Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - PDSketch: Integrated Planning Domain Programming and Learning [86.07442931141637]
We present a new domain definition language, named PDSketch.
It allows users to flexibly define high-level structures in the transition models.
Details of the transition model will be filled in by trainable neural networks.
arXiv Detail & Related papers (2023-03-09T18:54:12Z) - Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model.
Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z) - NAR-Former: Neural Architecture Representation Learning towards Holistic
Attributes Prediction [37.357949900603295]
We propose a neural architecture representation model that can be used to estimate attributes holistically.
Experiment results show that our proposed framework can be used to predict the latency and accuracy attributes of both cell architectures and whole deep neural networks.
arXiv Detail & Related papers (2022-11-15T10:15:21Z) - GradMax: Growing Neural Networks using Gradient Information [22.986063120002353]
We present a method that adds new neurons during training without impacting what is already learned, while improving the training dynamics.
We call this technique Gradient Maximizing Growth (GradMax) and demonstrate its effectiveness in variety of vision tasks and architectures.
arXiv Detail & Related papers (2022-01-13T18:30:18Z) - GradInit: Learning to Initialize Neural Networks for Stable and
Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks.
It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.
It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.