Masked Structural Growth for 2x Faster Language Model Pre-training
- URL: http://arxiv.org/abs/2305.02869v3
- Date: Sat, 6 Apr 2024 06:18:26 GMT
- Title: Masked Structural Growth for 2x Faster Language Model Pre-training
- Authors: Yiqun Yao, Zheng Zhang, Jing Li, Yequan Wang,
- Abstract summary: We focus on speeding up pre-training by progressively growing from a small Transformer structure to a large one.
In terms of growth schedule, the impact of each single dimension on a schedule's efficiency is under-explored by existing work.
We propose Masked Structural Growth (MSG), including (i) growth schedules involving all possible dimensions and (ii) strictly function-preserving growth operators.
- Score: 18.276784451675603
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accelerating large language model pre-training is a critical issue in present research. In this paper, we focus on speeding up pre-training by progressively growing from a small Transformer structure to a large one. There are two main research problems associated with progressive growth: determining the optimal growth schedule, and designing efficient growth operators. In terms of growth schedule, the impact of each single dimension on a schedule's efficiency is under-explored by existing work. Regarding the growth operators, existing methods rely on the initialization of new weights to inherit knowledge, and achieve only non-strict function preservation, limiting further improvements on training dynamics. To address these issues, we propose Masked Structural Growth (MSG), including (i) growth schedules involving all possible dimensions and (ii) strictly function-preserving growth operators that is independent of the initialization of new weights. Experiments show that MSG is significantly faster than related work: we achieve up to 2.2x speedup in pre-training different types of language models while maintaining comparable or better downstream performances. Code is publicly available at https://github.com/cofe-ai/MSG.
Related papers
- LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - Cogito, ergo sum: A Neurobiologically-Inspired Cognition-Memory-Growth System for Code Generation [9.920563105290894]
Cogito is a neurobiologically inspired multi-agent framework to enhance the problem-solving capabilities in code generation tasks with lower cost.
Cogito accumulates knowledge and cognitive skills at each stage,ultimately forming a Super Role an all capable agent to perform the code generation task.
arXiv Detail & Related papers (2025-01-30T01:41:44Z) - Overcoming Growth-Induced Forgetting in Task-Agnostic Continual Learning [9.91929539637026]
In continual learning (CL), model growth enhances adaptability over new data, improving knowledge retention for more tasks.
However, improper model growth can lead to severe degradation of previously learned knowledge, especially in task-agnostic CL using entire grown model for inference.
This paper presents a novel SparseGrow approach to overcome the issue of GIFt while enhancing adaptability over new data.
arXiv Detail & Related papers (2024-08-20T06:05:52Z) - Landscape-Aware Growing: The Power of a Little LAG [49.897766925371485]
We study the question of how to select the best growing strategy from a given pool of growing strategies.
We present an alternative perspective based on early training dynamics, which we call "landscape-aware growing (LAG)"
arXiv Detail & Related papers (2024-06-04T16:38:57Z) - TaE: Task-aware Expandable Representation for Long Tail Class Incremental Learning [42.630413950957795]
We introduce a novel Task-aware Expandable (TaE) framework to learn diverse representations from each incremental task.
TaE achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-02-08T16:37:04Z) - GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive
Language-Image Pre-training [78.63699436330165]
Cross-modal pre-training has shown impressive performance on a wide range of downstream tasks.
Online data are growing constantly, highlighting the importance of the ability of pre-trained model to learn from data that is continuously growing.
We propose GrowCLIP, a data-driven automatic model growing algorithm for contrastive language-image pre-training with continuous image-text pairs as input.
arXiv Detail & Related papers (2023-08-22T10:07:49Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Staged Training for Transformer Language Models [47.99321376123886]
We consider a staged training setup that begins with a small model and incrementally increases the amount of compute used for training.
By initializing each stage with the output of the previous one, the training process effectively re-uses the compute.
We empirically validate our growth operators and staged training for autoregressive language models, showing up to 22% compute savings.
arXiv Detail & Related papers (2022-03-11T19:05:42Z) - On the Transformer Growth for Progressive BERT Training [37.57617077192438]
We find that similar to network architecture search, Transformer growth also favors compound scaling.
In light of our analyses, the proposed method speeds up BERT pre-training by 73.6% and 82.2% for the base and large models respectively.
arXiv Detail & Related papers (2020-10-23T17:44:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.