Multi-node Bert-pretraining: Cost-efficient Approach
- URL: http://arxiv.org/abs/2008.00177v1
- Date: Sat, 1 Aug 2020 05:49:20 GMT
- Title: Multi-node Bert-pretraining: Cost-efficient Approach
- Authors: Jiahuang Lin, Xin Li, Gennady Pekhimenko
- Abstract summary: Large scale Transformer-based language models have brought about exciting leaps in state-of-the-art results for many Natural Language Processing (NLP) tasks.
With the advent of large-scale unsupervised datasets, training time is further extended due to the increased amount of data samples within a single training epoch.
We show that we are able to perform pre-training on BERT within a reasonable time budget (12 days) in an academic setting.
- Score: 6.5998084177955425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, large scale Transformer-based language models such as BERT, GPT-2,
and XLNet have brought about exciting leaps in state-of-the-art results for
many Natural Language Processing (NLP) tasks. One of the common trends in these
recent models is a significant increase in model complexity, which introduces
both more weights and computation. Moreover, with the advent of large-scale
unsupervised datasets, training time is further extended due to the increased
amount of data samples within a single training epoch. As a result, to train
these models within a reasonable time, machine learning (ML) programmers often
require advanced hardware setups such as the premium GPU-enabled NVIDIA DGX
workstations or specialized accelerators such as Google's TPU Pods. Our work
addresses this limitation and demonstrates that the BERT pre-trained model can
be trained within 2 weeks on an academic-size cluster of widely available GPUs
through careful algorithmic and software optimizations. In this paper, we
present these optimizations on how to improve single device training
throughput, distribute the training workload over multiple nodes and GPUs, and
overcome the communication bottleneck introduced by the large data exchanges
over the network. We show that we are able to perform pre-training on BERT
within a reasonable time budget (12 days) in an academic setting, but with a
much less expensive and less aggressive hardware resource requirement than in
previously demonstrated industrial settings based on NVIDIA DGX machines or
Google's TPU Pods.
Related papers
- Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms [4.959530958049395]
We develop a pipeline to Characterize and predict the training performance of modern machine learning (ML) workloads on compute systems.
Our pipeline generalizes to other types of ML workloads, such as Transformer-based NLP models.
It is capable of generating insights such as quickly selecting the fastest embedding table sharding configuration.
arXiv Detail & Related papers (2024-04-19T07:20:33Z) - HLAT: High-quality Large Language Model Pre-trained on AWS Trainium [21.183733616898365]
Large language models (LLMs) to perform well on the downstream tasks requires pre-training over trillions of tokens.
This typically demands a large number of powerful computational devices in addition to a stable distributed training framework to accelerate the training.
AWS Trainium is the second-generation machine learning accelerator that has been purposely built for training large deep learning models.
arXiv Detail & Related papers (2024-04-16T15:02:46Z) - Harnessing Manycore Processors with Distributed Memory for Accelerated
Training of Sparse and Recurrent Models [43.1773057439246]
Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures.
We explore sparse and recurrent model training on a massively parallel multiple instruction multiple data architecture with distributed local memory.
arXiv Detail & Related papers (2023-11-07T23:18:35Z) - PILOT: A Pre-Trained Model-Based Continual Learning Toolbox [71.63186089279218]
This paper introduces a pre-trained model-based continual learning toolbox known as PILOT.
On the one hand, PILOT implements some state-of-the-art class-incremental learning algorithms based on pre-trained models, such as L2P, DualPrompt, and CODA-Prompt.
On the other hand, PILOT fits typical class-incremental learning algorithms within the context of pre-trained models to evaluate their effectiveness.
arXiv Detail & Related papers (2023-09-13T17:55:11Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - Horizontally Fused Training Array: An Effective Hardware Utilization
Squeezer for Training Novel Deep Learning Models [8.055533378391814]
We show that single-accelerator training jobs can dominate the cluster-wide resource consumption when launched repetitively.
We propose Horizontally Fused Training Array (HFTA) to help DL researchers and practitioners effectively and easily improve the hardware utilization of their novel DL training workloads.
HFTA demonstrates strong effectiveness in squeezing out hardware utilization and achieves up to $15.1 times$ higher training throughput vs. the standard practice of running each job on a separate accelerator.
arXiv Detail & Related papers (2021-02-03T23:56:55Z) - EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets [106.79387235014379]
EarlyBERT is a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models.
We are the first to identify structured winning tickets in the early stage of BERT training, and use them for efficient training.
EarlyBERT easily achieves comparable performance to standard BERT with 3545% less training time.
arXiv Detail & Related papers (2020-12-31T20:38:20Z) - A Tensor Compiler for Unified Machine Learning Prediction Serving [8.362773007171118]
Machine Learning (ML) adoption in the enterprise requires simpler and more efficient software infrastructure.
Model scoring is a primary contributor to infrastructure complexity and cost as models are trained once but used many times.
We propose HUMMINGBIRD, a novel approach to model scoring that compiles featurization operators and traditional ML models into a small set of tensor operations.
arXiv Detail & Related papers (2020-10-09T21:02:47Z) - Real-Time Execution of Large-scale Language Models on Mobile [49.32610509282623]
We find the best model structure of BERT for a given computation size to match specific devices.
Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices.
Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base.
arXiv Detail & Related papers (2020-09-15T01:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.