Related papers: M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining

M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining

URL: http://arxiv.org/abs/2110.03888v1
Date: Fri, 8 Oct 2021 04:24:51 GMT
Title: M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining
Authors: Junyang Lin, An Yang, Jinze Bai, Chang Zhou, Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Yong Li, Wei Lin, Jingren Zhou, Hongxia Yang
Abstract summary: Training extreme-scale models requires enormous amounts of computes and memory footprint. We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
Score: 55.16088793437898
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent expeditious developments in deep learning algorithms, distributed training, and even hardware design for large models have enabled training extreme-scale models, say GPT-3 and Switch Transformer possessing hundreds of billions or even trillions of parameters. However, under limited resources, extreme-scale model training that requires enormous amounts of computes and memory footprint suffers from frustratingly low efficiency in model convergence. In this paper, we propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models. Pseudo-to-Real is compatible with large models with architecture of sequential layers. We demonstrate a practice of pretraining unprecedented 10-trillion-parameter model, an order of magnitude larger than the state-of-the-art, on solely 512 GPUs within 10 days. Besides demonstrating the application of Pseudo-to-Real, we also provide a technique, Granular CPU offloading, to manage CPU memory for training large model and maintain high GPU utilities. Fast training of extreme-scale models on a decent amount of resources can bring much smaller carbon footprint and contribute to greener AI.

Related papers

AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation. Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads. We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z)
Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models [40.41898661688188]
This paper introduces Superpipeline, a framework designed to optimize the execution of large AI models on constrained hardware. Superpipeline reduces GPU memory usage by up to 60% in our experiments while maintaining model accuracy and acceptable processing speeds.
arXiv Detail & Related papers (2024-10-11T13:17:05Z)
Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches [64.42735183056062]
Large language models (LLMs) have transitioned from specialized models to versatile foundation models. LLMs exhibit impressive zero-shot ability, however, require fine-tuning on local datasets and significant resources for deployment.
arXiv Detail & Related papers (2024-08-20T09:42:17Z)
Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone. We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z)
Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models [43.1773057439246]
Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures. We explore sparse and recurrent model training on a massively parallel multiple instruction multiple data architecture with distributed local memory.
arXiv Detail & Related papers (2023-11-07T23:18:35Z)
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z)
Hydra: A System for Large Multi-Model Deep Learning [3.571623412954477]
We present'model spilling', a technique aimed at models such as Transformers and CNNs to move groups of layers between DRAM and GPU memory. We then present a set of novel techniques leveraging spilling to raise efficiency for multi-model training workloads. Experiments with real benchmark workloads show that HYDRA is over 7x faster than regular model parallelism and over 50% faster than state-of-the-art industrial tools for pipeline parallelism.
arXiv Detail & Related papers (2021-10-16T18:13:57Z)
Top-KAST: Top-K Always Sparse Training [50.05611544535801]
We propose Top-KAST, a method that preserves constant sparsity throughout training. We show that it performs comparably to or better than previous works when training models on the established ImageNet benchmark. In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling.
arXiv Detail & Related papers (2021-06-07T11:13:05Z)
Efficient Large-Scale Language Model Training on GPU Clusters [19.00915720435389]
Large language models have led to state-of-the-art accuracies across a range of tasks. Memory capacity is limited, making it impossible to fit large models on a single GPU. The number of compute operations required to train these models can result in unrealistically long training times.
arXiv Detail & Related papers (2021-04-09T16:43:11Z)
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.