Deep Fusion: Efficient Network Training via Pre-trained Initializations
- URL: http://arxiv.org/abs/2306.11903v3
- Date: Wed, 26 Jun 2024 12:16:57 GMT
- Title: Deep Fusion: Efficient Network Training via Pre-trained Initializations
- Authors: Hanna Mazzawi, Xavi Gonzalvo, Michael Wunder, Sammy Jerome, Benoit Dherin,
- Abstract summary: We present Deep Fusion, an efficient approach to network training that leverages pre-trained initializations of smaller networks.
Our experiments show how Deep Fusion is a practical and effective approach that not only accelerates the training process but also reduces computational requirements.
We validate our theoretical framework, which guides the optimal use of Deep Fusion, showing that it significantly reduces both training time and resource consumption.
- Score: 3.9146761527401424
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, deep learning has made remarkable progress in a wide range of domains, with a particularly notable impact on natural language processing tasks. One of the challenges associated with training deep neural networks in the context of LLMs is the need for large amounts of computational resources and time. To mitigate this, network growing algorithms offer potential cost savings, but their underlying mechanisms are poorly understood. We present two notable contributions in this paper. First, we present Deep Fusion, an efficient approach to network training that leverages pre-trained initializations of smaller networks. Second, we propose a theoretical framework using backward error analysis to illustrate the dynamics of mid-training network growth. Our experiments show how Deep Fusion is a practical and effective approach that not only accelerates the training process but also reduces computational requirements, maintaining or surpassing traditional training methods' performance in various NLP tasks and T5 model sizes. Finally, we validate our theoretical framework, which guides the optimal use of Deep Fusion, showing that with carefully optimized training dynamics, it significantly reduces both training time and resource consumption.
Related papers
- Going Forward-Forward in Distributed Deep Learning [0.0]
We introduce a new approach in distributed deep learning, utilizing Geoffrey Hinton's Forward-Forward (FF) algorithm.
Unlike traditional methods that rely on forward and backward passes, the FF algorithm employs a dual forward pass strategy.
Our evaluation shows a 3.75 times speed up on MNIST dataset without compromising accuracy when training a four-layer network with four compute nodes.
arXiv Detail & Related papers (2024-03-30T16:02:53Z) - Finding Influencers in Complex Networks: An Effective Deep Reinforcement
Learning Approach [13.439099770154952]
We propose an effective reinforcement learning model that achieves superior performances over traditional best influence algorithms.
Specifically, we design an end-to-end learning framework that combines graph neural network algorithms as the encoder and reinforcement learning as the decoder, named DREIM.
arXiv Detail & Related papers (2023-09-09T14:19:00Z) - Exploring Low Rank Training of Deep Neural Networks [49.18122605463354]
Training deep neural networks in low rank offers efficiency over unfactorised training in terms of both memory consumption and training time.
We analyse techniques that work well in practice, and through extensive ablations on models such as GPT2 we provide evidence falsifying common beliefs in the field.
arXiv Detail & Related papers (2022-09-27T17:43:45Z) - Training Deep Neural Networks with Joint Quantization and Pruning of
Weights and Activations [5.17729871332369]
State-of-the-art quantization techniques are currently applied to both the weights and activations of deep neural networks.
In this work, we jointly apply novel uniform quantization and unstructured pruning methods to both the weights and activations of deep neural networks during training.
arXiv Detail & Related papers (2021-10-15T16:14:36Z) - Dynamic Sparse Training for Deep Reinforcement Learning [36.66889208433228]
We propose for the first time to dynamically train deep reinforcement learning agents with sparse neural networks from scratch.
Our approach is easy to be integrated into existing deep reinforcement learning algorithms.
We evaluate our approach on OpenAI gym continuous control tasks.
arXiv Detail & Related papers (2021-06-08T09:57:20Z) - Training Larger Networks for Deep Reinforcement Learning [18.193180866998333]
We show that naively increasing network capacity does not improve performance.
We propose a novel method that consists of 1) wider networks with DenseNet connection, 2) decoupling representation learning from training of RL, and 3) a distributed training method to mitigate overfitting problems.
Using this three-fold technique, we show that we can train very large networks that result in significant performance gains.
arXiv Detail & Related papers (2021-02-16T02:16:54Z) - Sparsity in Deep Learning: Pruning and growth for efficient inference
and training in neural networks [78.47459801017959]
Sparsity can reduce the memory footprint of regular networks to fit mobile devices.
We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice.
arXiv Detail & Related papers (2021-01-31T22:48:50Z) - Go Wide, Then Narrow: Efficient Training of Deep Thin Networks [62.26044348366186]
We propose an efficient method to train a deep thin network with a theoretic guarantee.
By training with our method, ResNet50 can outperform ResNet101, and BERT Base can be comparable with BERT Large.
arXiv Detail & Related papers (2020-07-01T23:34:35Z) - Robust Pruning at Initialization [61.30574156442608]
A growing need for smaller, energy-efficient, neural networks to be able to use machine learning applications on devices with limited computational resources.
For Deep NNs, such procedures remain unsatisfactory as the resulting pruned networks can be difficult to train and, for instance, they do not prevent one layer from being fully pruned.
arXiv Detail & Related papers (2020-02-19T17:09:50Z) - Large-Scale Gradient-Free Deep Learning with Recursive Local
Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources.
Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize.
We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z) - Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications.
In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training.
Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.