Masked Structural Growth for 2x Faster Language Model Pre-training
- URL: http://arxiv.org/abs/2305.02869v3
- Date: Sat, 6 Apr 2024 06:18:26 GMT
- Title: Masked Structural Growth for 2x Faster Language Model Pre-training
- Authors: Yiqun Yao, Zheng Zhang, Jing Li, Yequan Wang,
- Abstract summary: We focus on speeding up pre-training by progressively growing from a small Transformer structure to a large one.
In terms of growth schedule, the impact of each single dimension on a schedule's efficiency is under-explored by existing work.
We propose Masked Structural Growth (MSG), including (i) growth schedules involving all possible dimensions and (ii) strictly function-preserving growth operators.
- Score: 18.276784451675603
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accelerating large language model pre-training is a critical issue in present research. In this paper, we focus on speeding up pre-training by progressively growing from a small Transformer structure to a large one. There are two main research problems associated with progressive growth: determining the optimal growth schedule, and designing efficient growth operators. In terms of growth schedule, the impact of each single dimension on a schedule's efficiency is under-explored by existing work. Regarding the growth operators, existing methods rely on the initialization of new weights to inherit knowledge, and achieve only non-strict function preservation, limiting further improvements on training dynamics. To address these issues, we propose Masked Structural Growth (MSG), including (i) growth schedules involving all possible dimensions and (ii) strictly function-preserving growth operators that is independent of the initialization of new weights. Experiments show that MSG is significantly faster than related work: we achieve up to 2.2x speedup in pre-training different types of language models while maintaining comparable or better downstream performances. Code is publicly available at https://github.com/cofe-ai/MSG.
Related papers
- Overcoming Growth-Induced Forgetting in Task-Agnostic Continual Learning [9.91929539637026]
In continual learning (CL), model growth enhances adaptability over new data, improving knowledge retention for more tasks.
However, improper model growth can lead to severe degradation of previously learned knowledge, especially in task-agnostic CL using entire grown model for inference.
This paper presents a novel SparseGrow approach to overcome the issue of GIFt while enhancing adaptability over new data.
arXiv Detail & Related papers (2024-08-20T06:05:52Z) - Landscape-Aware Growing: The Power of a Little LAG [49.897766925371485]
We study the question of how to select the best growing strategy from a given pool of growing strategies.
We present an alternative perspective based on early training dynamics, which we call "landscape-aware growing (LAG)"
arXiv Detail & Related papers (2024-06-04T16:38:57Z) - On the Scalability of GNNs for Molecular Graphs [7.402389334892391]
Graph Neural Networks (GNNs) are yet to show the benefits of scale due to the lower efficiency of sparse operations, large data requirements, and lack of clarity about the effectiveness of various architectures.
We analyze message-passing networks, graph Transformers, and hybrid architectures on the largest public collection of 2D molecular graphs.
For the first time, we observe that GNNs benefit tremendously from the increasing scale of depth, width, number of molecules, number of labels, and the diversity in the pretraining datasets.
arXiv Detail & Related papers (2024-04-17T17:11:31Z) - TaE: Task-aware Expandable Representation for Long Tail Class Incremental Learning [42.630413950957795]
We introduce a novel Task-aware Expandable (TaE) framework to learn diverse representations from each incremental task.
TaE achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-02-08T16:37:04Z) - GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive
Language-Image Pre-training [78.63699436330165]
Cross-modal pre-training has shown impressive performance on a wide range of downstream tasks.
Online data are growing constantly, highlighting the importance of the ability of pre-trained model to learn from data that is continuously growing.
We propose GrowCLIP, a data-driven automatic model growing algorithm for contrastive language-image pre-training with continuous image-text pairs as input.
arXiv Detail & Related papers (2023-08-22T10:07:49Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model.
Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Staged Training for Transformer Language Models [47.99321376123886]
We consider a staged training setup that begins with a small model and incrementally increases the amount of compute used for training.
By initializing each stage with the output of the previous one, the training process effectively re-uses the compute.
We empirically validate our growth operators and staged training for autoregressive language models, showing up to 22% compute savings.
arXiv Detail & Related papers (2022-03-11T19:05:42Z) - On the Transformer Growth for Progressive BERT Training [37.57617077192438]
We find that similar to network architecture search, Transformer growth also favors compound scaling.
In light of our analyses, the proposed method speeds up BERT pre-training by 73.6% and 82.2% for the base and large models respectively.
arXiv Detail & Related papers (2020-10-23T17:44:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.