Related papers: Masked Structural Growth for 2x Faster Language Model Pre-training

Masked Structural Growth for 2x Faster Language Model Pre-training

URL: http://arxiv.org/abs/2305.02869v3
Date: Sat, 6 Apr 2024 06:18:26 GMT
Title: Masked Structural Growth for 2x Faster Language Model Pre-training
Authors: Yiqun Yao, Zheng Zhang, Jing Li, Yequan Wang,
Abstract summary: We focus on speeding up pre-training by progressively growing from a small Transformer structure to a large one. In terms of growth schedule, the impact of each single dimension on a schedule's efficiency is under-explored by existing work. We propose Masked Structural Growth (MSG), including (i) growth schedules involving all possible dimensions and (ii) strictly function-preserving growth operators.
Score: 18.276784451675603
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Accelerating large language model pre-training is a critical issue in present research. In this paper, we focus on speeding up pre-training by progressively growing from a small Transformer structure to a large one. There are two main research problems associated with progressive growth: determining the optimal growth schedule, and designing efficient growth operators. In terms of growth schedule, the impact of each single dimension on a schedule's efficiency is under-explored by existing work. Regarding the growth operators, existing methods rely on the initialization of new weights to inherit knowledge, and achieve only non-strict function preservation, limiting further improvements on training dynamics. To address these issues, we propose Masked Structural Growth (MSG), including (i) growth schedules involving all possible dimensions and (ii) strictly function-preserving growth operators that is independent of the initialization of new weights. Experiments show that MSG is significantly faster than related work: we achieve up to 2.2x speedup in pre-training different types of language models while maintaining comparable or better downstream performances. Code is publicly available at https://github.com/cofe-ai/MSG.

Related papers

Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training. This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation. Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
Cogito, ergo sum: A Neurobiologically-Inspired Cognition-Memory-Growth System for Code Generation [9.920563105290894]
Cogito is a neurobiologically inspired multi-agent framework to enhance the problem-solving capabilities in code generation tasks with lower cost. Cogito accumulates knowledge and cognitive skills at each stage,ultimately forming a Super Role an all capable agent to perform the code generation task.
arXiv Detail & Related papers (2025-01-30T01:41:44Z)
Overcoming Growth-Induced Forgetting in Task-Agnostic Continual Learning [9.91929539637026]
In continual learning (CL), model growth enhances adaptability over new data, improving knowledge retention for more tasks. However, improper model growth can lead to severe degradation of previously learned knowledge, especially in task-agnostic CL using entire grown model for inference. This paper presents a novel SparseGrow approach to overcome the issue of GIFt while enhancing adaptability over new data.
arXiv Detail & Related papers (2024-08-20T06:05:52Z)
Landscape-Aware Growing: The Power of a Little LAG [49.897766925371485]
We study the question of how to select the best growing strategy from a given pool of growing strategies. We present an alternative perspective based on early training dynamics, which we call "landscape-aware growing (LAG)"
arXiv Detail & Related papers (2024-06-04T16:38:57Z)
On the Scalability of GNNs for Molecular Graphs [7.402389334892391]
Graph Neural Networks (GNNs) are yet to show the benefits of scale due to the lower efficiency of sparse operations, large data requirements, and lack of clarity about the effectiveness of various architectures. We analyze message-passing networks, graph Transformers, and hybrid architectures on the largest public collection of 2D molecular graphs. For the first time, we observe that GNNs benefit tremendously from the increasing scale of depth, width, number of molecules, number of labels, and the diversity in the pretraining datasets.
arXiv Detail & Related papers (2024-04-17T17:11:31Z)
TaE: Task-aware Expandable Representation for Long Tail Class Incremental Learning [42.630413950957795]
We introduce a novel Task-aware Expandable (TaE) framework to learn diverse representations from each incremental task. TaE achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-02-08T16:37:04Z)
GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training [78.63699436330165]
Cross-modal pre-training has shown impressive performance on a wide range of downstream tasks. Online data are growing constantly, highlighting the importance of the ability of pre-trained model to learn from data that is continuously growing. We propose GrowCLIP, a data-driven automatic model growing algorithm for contrastive language-image pre-training with continuous image-text pairs as input.
arXiv Detail & Related papers (2023-08-22T10:07:49Z)
Scaling Pre-trained Language Models to Deeper via Parameter-efficient Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO) MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts. Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z)
Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model. Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z)
Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep. We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
Staged Training for Transformer Language Models [47.99321376123886]
We consider a staged training setup that begins with a small model and incrementally increases the amount of compute used for training. By initializing each stage with the output of the previous one, the training process effectively re-uses the compute. We empirically validate our growth operators and staged training for autoregressive language models, showing up to 22% compute savings.
arXiv Detail & Related papers (2022-03-11T19:05:42Z)
On the Transformer Growth for Progressive BERT Training [37.57617077192438]
We find that similar to network architecture search, Transformer growth also favors compound scaling. In light of our analyses, the proposed method speeds up BERT pre-training by 73.6% and 82.2% for the base and large models respectively.
arXiv Detail & Related papers (2020-10-23T17:44:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.