M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining
- URL: http://arxiv.org/abs/2110.03888v1
- Date: Fri, 8 Oct 2021 04:24:51 GMT
- Title: M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining
- Authors: Junyang Lin, An Yang, Jinze Bai, Chang Zhou, Le Jiang, Xianyan Jia,
Ang Wang, Jie Zhang, Yong Li, Wei Lin, Jingren Zhou, Hongxia Yang
- Abstract summary: Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
- Score: 55.16088793437898
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent expeditious developments in deep learning algorithms, distributed
training, and even hardware design for large models have enabled training
extreme-scale models, say GPT-3 and Switch Transformer possessing hundreds of
billions or even trillions of parameters. However, under limited resources,
extreme-scale model training that requires enormous amounts of computes and
memory footprint suffers from frustratingly low efficiency in model
convergence. In this paper, we propose a simple training strategy called
"Pseudo-to-Real" for high-memory-footprint-required large models.
Pseudo-to-Real is compatible with large models with architecture of sequential
layers. We demonstrate a practice of pretraining unprecedented
10-trillion-parameter model, an order of magnitude larger than the
state-of-the-art, on solely 512 GPUs within 10 days. Besides demonstrating the
application of Pseudo-to-Real, we also provide a technique, Granular CPU
offloading, to manage CPU memory for training large model and maintain high GPU
utilities. Fast training of extreme-scale models on a decent amount of
resources can bring much smaller carbon footprint and contribute to greener AI.
Related papers
- Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models [40.41898661688188]
This paper introduces Superpipeline, a framework designed to optimize the execution of large AI models on constrained hardware.
Superpipeline reduces GPU memory usage by up to 60% in our experiments while maintaining model accuracy and acceptable processing speeds.
arXiv Detail & Related papers (2024-10-11T13:17:05Z) - Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches [64.42735183056062]
Large language models (LLMs) have transitioned from specialized models to versatile foundation models.
LLMs exhibit impressive zero-shot ability, however, require fine-tuning on local datasets and significant resources for deployment.
arXiv Detail & Related papers (2024-08-20T09:42:17Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - Harnessing Manycore Processors with Distributed Memory for Accelerated
Training of Sparse and Recurrent Models [43.1773057439246]
Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures.
We explore sparse and recurrent model training on a massively parallel multiple instruction multiple data architecture with distributed local memory.
arXiv Detail & Related papers (2023-11-07T23:18:35Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - Hydra: A System for Large Multi-Model Deep Learning [3.571623412954477]
We present'model spilling', a technique aimed at models such as Transformers and CNNs to move groups of layers between DRAM and GPU memory.
We then present a set of novel techniques leveraging spilling to raise efficiency for multi-model training workloads.
Experiments with real benchmark workloads show that HYDRA is over 7x faster than regular model parallelism and over 50% faster than state-of-the-art industrial tools for pipeline parallelism.
arXiv Detail & Related papers (2021-10-16T18:13:57Z) - Top-KAST: Top-K Always Sparse Training [50.05611544535801]
We propose Top-KAST, a method that preserves constant sparsity throughout training.
We show that it performs comparably to or better than previous works when training models on the established ImageNet benchmark.
In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling.
arXiv Detail & Related papers (2021-06-07T11:13:05Z) - Efficient Large-Scale Language Model Training on GPU Clusters [19.00915720435389]
Large language models have led to state-of-the-art accuracies across a range of tasks.
Memory capacity is limited, making it impossible to fit large models on a single GPU.
The number of compute operations required to train these models can result in unrealistically long training times.
arXiv Detail & Related papers (2021-04-09T16:43:11Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.