Evolving Subnetwork Training for Large Language Models
- URL: http://arxiv.org/abs/2406.06962v1
- Date: Tue, 11 Jun 2024 05:44:56 GMT
- Title: Evolving Subnetwork Training for Large Language Models
- Authors: Hanqi Li, Lu Chen, Da Ma, Zijian Wu, Su Zhu, Kai Yu,
- Abstract summary: We propose a novel training paradigm: Evolving Subnetwork Training (EST)
EST samplesworks from the layers of the large language model and from commonly used modules within each layer.
We apply EST to train GPT2 model and TinyLlama model, resulting in 26.7% FLOPs saving for GPT2 and 25.0% for TinyLlama without an increase in loss on the pre-training dataset.
- Score: 19.54861230097017
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models have ushered in a new era of artificial intelligence research. However, their substantial training costs hinder further development and widespread adoption. In this paper, inspired by the redundancy in the parameters of large language models, we propose a novel training paradigm: Evolving Subnetwork Training (EST). EST samples subnetworks from the layers of the large language model and from commonly used modules within each layer, Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP). By gradually increasing the size of the subnetworks during the training process, EST can save the cost of training. We apply EST to train GPT2 model and TinyLlama model, resulting in 26.7\% FLOPs saving for GPT2 and 25.0\% for TinyLlama without an increase in loss on the pre-training dataset. Moreover, EST leads to performance improvements in downstream tasks, indicating that it benefits generalization. Additionally, we provide intuitive theoretical studies based on training dynamics and Dropout theory to ensure the feasibility of EST. Our code is available at https://github.com/OpenDFM/EST.
Related papers
- Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress.
LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset.
Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z) - Test-Time Training on Graphs with Large Language Models (LLMs) [68.375487369596]
Test-Time Training (TTT) has been proposed as a promising approach to train Graph Neural Networks (GNNs)
Inspired by the great annotation ability of Large Language Models (LLMs) on Text-Attributed Graphs (TAGs), we propose to enhance the test-time training on graphs with LLMs as annotators.
A two-stage training strategy is designed to tailor the test-time model with the limited and noisy labels.
arXiv Detail & Related papers (2024-04-21T08:20:02Z) - Inheritune: Training Smaller Yet More Attentive Language Models [61.363259848264725]
Inheritune is a simple yet effective training recipe for developing smaller, high-performing language models.
We demonstrate that Inheritune enables the training of various sizes of GPT-2 models on datasets like OpenWebText-9B and FineWeb_edu.
arXiv Detail & Related papers (2024-04-12T17:53:34Z) - INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of
Language Models [40.54353850357839]
We show how we can employ submodular optimization to select highly representative subsets of the training corpora.
We show that the resulting models achieve up to $sim99%$ of the performance of the fully-trained models.
arXiv Detail & Related papers (2023-05-11T09:24:41Z) - SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language
Models [4.114555639014612]
We show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training.
We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs.
arXiv Detail & Related papers (2023-03-18T17:56:01Z) - Learning to Weight Samples for Dynamic Early-exiting Networks [35.03752825893429]
Early exiting is an effective paradigm for improving the inference efficiency of deep networks.
Our work proposes to adopt a weight prediction network to weight the loss of different training samples at each exit.
We show that the proposed weighting mechanism consistently improves the trade-off between classification accuracy and inference efficiency.
arXiv Detail & Related papers (2022-09-17T10:46:32Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Neural Semi-supervised Learning for Text Classification Under
Large-Scale Pretraining [51.19885385587916]
We conduct studies on semi-supervised learning in the task of text classification under the context of large-scale LM pretraining.
Our work marks an initial step in understanding the behavior of semi-supervised learning models under the context of large-scale pretraining.
arXiv Detail & Related papers (2020-11-17T13:39:05Z) - The Lottery Ticket Hypothesis for Pre-trained BERT Networks [137.99328302234338]
In natural language processing (NLP), enormous pre-trained models like BERT have become the standard starting point for training.
In parallel, work on the lottery ticket hypothesis has shown that models for NLP and computer vision contain smaller matchingworks capable of training in isolation to full accuracy.
We combine these observations to assess whether such trainable, transferrableworks exist in pre-trained BERT models.
arXiv Detail & Related papers (2020-07-23T19:35:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.