PatrickStar: Parallel Training of Pre-trained Models via a Chunk-based
Memory Management
- URL: http://arxiv.org/abs/2108.05818v1
- Date: Thu, 12 Aug 2021 15:58:12 GMT
- Title: PatrickStar: Parallel Training of Pre-trained Models via a Chunk-based
Memory Management
- Authors: Jiarui Fang, Yang Yu, Shenggui Li, Yang You, Jie Zhou
- Abstract summary: Pre-trained model (PTM) is revolutionizing Artificial intelligence (AI) technology.
PTM learns a model with general language features on the vast text and then fine-tunes the model using a task-specific dataset.
PatrickStar reduces memory requirements of computing platforms by using heterogeneous memory space.
- Score: 19.341284825473558
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The pre-trained model (PTM) is revolutionizing Artificial intelligence (AI)
technology. It learns a model with general language features on the vast text
and then fine-tunes the model using a task-specific dataset. Unfortunately, PTM
training requires prohibitively expensive computing devices, especially
fine-tuning, which is still a game for a small proportion of people in the AI
community. Enabling PTMs training on low-quality devices, PatrickStar now makes
PTM accessible to everyone.
PatrickStar reduces memory requirements of computing platforms by using the
CPU-GPU heterogeneous memory space to store model data, consisting of
parameters, gradients, and optimizer states. We observe that the GPU memory
available for model data changes regularly, in a tide-like pattern, decreasing
and increasing iteratively. However, the existing heterogeneous training works
do not take advantage of this pattern. Instead, they statically partition the
model data among CPU and GPU, leading to both memory waste and memory abuse. In
contrast, PatrickStar manages model data in chunks, which are dynamically
distributed in heterogeneous memory spaces. Chunks consist of stateful tensors
which run as finite state machines during training. Guided by the runtime
memory statistics collected in a warm-up iteration, chunks are orchestrated
efficiently in heterogeneous memory and generate lower CPU-GPU data
transmission volume. Symbiosis with the Zero Redundancy Optimizer, PatrickStar
scales to multiple GPUs using data parallelism, with the lowest communication
bandwidth requirements and more efficient bandwidth utilization. Experimental
results show PatrickStar trains a 12 billion parameters GPT model, 2x larger
than the STOA work, on an 8-V100 and 240GB CPU memory node, and is also more
efficient on the same model size.
Related papers
- Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading [2.8231000588510757]
Transformers and large language models(LLMs) have seen rapid adoption in all domains.
Training of transformers is very expensive and often hits a memory wall''
We propose a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU.
arXiv Detail & Related papers (2024-10-26T00:43:59Z) - Pipette: Automatic Fine-grained Large Language Model Training Configurator for Real-World Clusters [5.190794062263327]
Training large language models (LLMs) is known to be challenging because of the huge computational and memory capacity requirements.
We propose Pipette, which is an automatic fine-grained LLM training for real-world clusters.
arXiv Detail & Related papers (2024-05-28T11:59:44Z) - AI and Memory Wall [81.06494558184049]
We show how memory bandwidth can become the dominant bottleneck for decoder models.
We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
arXiv Detail & Related papers (2024-03-21T04:31:59Z) - Elixir: Train a Large Language Model on a Small GPU Cluster [6.578131399847817]
Large language models have achieved great success due to their unprecedented size.
Elixir automates efficient large-model training based on pre-runtime model profiling.
Elixir significantly outperforms the current state-of-the-art baseline.
arXiv Detail & Related papers (2022-12-10T17:26:05Z) - Incremental Online Learning Algorithms Comparison for Gesture and Visual
Smart Sensors [68.8204255655161]
This paper compares four state-of-the-art algorithms in two real applications: gesture recognition based on accelerometer data and image classification.
Our results confirm these systems' reliability and the feasibility of deploying them in tiny-memory MCUs.
arXiv Detail & Related papers (2022-09-01T17:05:20Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - Communication-Efficient TeraByte-Scale Model Training Framework for
Online Advertising [32.5337643852876]
Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry.
We identify two major challenges in the existing GPU training for massivescale ad models.
We propose a hardware-aware training workflow that couples the hardware topology into the algorithm design.
arXiv Detail & Related papers (2022-01-05T18:09:11Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages.
We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z) - ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models
with Huge Embedding Table [23.264897780201316]
Various deep Click-Through Rate (CTR) models are deployed in the commercial systems by industrial companies.
To achieve better performance, it is necessary to train the deep CTR models on huge volume of training data efficiently.
We propose the ScaleFreeCTR: a MixCache-based distributed training system for CTR models.
arXiv Detail & Related papers (2021-04-17T13:36:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.