HLAT: High-quality Large Language Model Pre-trained on AWS Trainium
- URL: http://arxiv.org/abs/2404.10630v1
- Date: Tue, 16 Apr 2024 15:02:46 GMT
- Title: HLAT: High-quality Large Language Model Pre-trained on AWS Trainium
- Authors: Haozheng Fan, Hao Zhou, Guangtai Huang, Parameswaran Raman, Xinwei Fu, Gaurav Gupta, Dhananjay Ram, Yida Wang, Jun Huan,
- Abstract summary: Large language models (LLMs) to perform well on the downstream tasks requires pre-training over trillions of tokens.
This typically demands a large number of powerful computational devices in addition to a stable distributed training framework to accelerate the training.
AWS Trainium is the second-generation machine learning accelerator that has been purposely built for training large deep learning models.
- Score: 21.183733616898365
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Getting large language models (LLMs) to perform well on the downstream tasks requires pre-training over trillions of tokens. This typically demands a large number of powerful computational devices in addition to a stable distributed training framework to accelerate the training. The growing number of applications leveraging AI/ML had led to a scarcity of the expensive conventional accelerators (such as GPUs), which begs the need for the alternative specialized-accelerators that are scalable and cost-efficient. AWS Trainium is the second-generation machine learning accelerator that has been purposely built for training large deep learning models. Its corresponding instance, Amazon EC2 trn1, is an alternative to GPU instances for LLM training. However, training LLMs with billions of parameters on trn1 is challenging due to its relatively nascent software ecosystem. In this paper, we showcase HLAT: a 7 billion parameter decoder-only LLM pre-trained using trn1 instances over 1.8 trillion tokens. The performance of HLAT is benchmarked against popular open source baseline models including LLaMA and OpenLLaMA, which have been trained on NVIDIA GPUs and Google TPUs, respectively. On various evaluation tasks, we show that HLAT achieves model quality on par with the baselines. We also share the best practice of using the Neuron Distributed Training Library (NDTL), a customized distributed training library for AWS Trainium to achieve efficient training. Our work demonstrates that AWS Trainium powered by the NDTL is able to successfully pre-train state-of-the-art LLM models with high performance and cost-effectiveness.
Related papers
- NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation [32.01836613286288]
This work presents a Fully BInarized Large Language Model (FBI-LLM)
It demonstrates for the first time how to train a large-scale binary language model from scratch.
arXiv Detail & Related papers (2024-07-09T17:59:48Z) - Sparsity-Accelerated Training for Large Language Models [20.86225596276327]
Large language models (LLMs) have demonstrated proficiency across various natural language processing (NLP) tasks.
LLMs often require additional training, such as continual pre-training and supervised fine-tuning.
This paper proposes leveraging emphsparsity in pre-trained LLMs to expedite this training process.
arXiv Detail & Related papers (2024-06-03T14:56:09Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference.
We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch.
We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z) - EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets [106.79387235014379]
EarlyBERT is a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models.
We are the first to identify structured winning tickets in the early stage of BERT training, and use them for efficient training.
EarlyBERT easily achieves comparable performance to standard BERT with 3545% less training time.
arXiv Detail & Related papers (2020-12-31T20:38:20Z) - Multi-node Bert-pretraining: Cost-efficient Approach [6.5998084177955425]
Large scale Transformer-based language models have brought about exciting leaps in state-of-the-art results for many Natural Language Processing (NLP) tasks.
With the advent of large-scale unsupervised datasets, training time is further extended due to the increased amount of data samples within a single training epoch.
We show that we are able to perform pre-training on BERT within a reasonable time budget (12 days) in an academic setting.
arXiv Detail & Related papers (2020-08-01T05:49:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.