Related papers: HLAT: High-quality Large Language Model Pre-trained on AWS Trainium

HLAT: High-quality Large Language Model Pre-trained on AWS Trainium

URL: http://arxiv.org/abs/2404.10630v1
Date: Tue, 16 Apr 2024 15:02:46 GMT
Title: HLAT: High-quality Large Language Model Pre-trained on AWS Trainium
Authors: Haozheng Fan, Hao Zhou, Guangtai Huang, Parameswaran Raman, Xinwei Fu, Gaurav Gupta, Dhananjay Ram, Yida Wang, Jun Huan,
Abstract summary: Large language models (LLMs) to perform well on the downstream tasks requires pre-training over trillions of tokens. This typically demands a large number of powerful computational devices in addition to a stable distributed training framework to accelerate the training. AWS Trainium is the second-generation machine learning accelerator that has been purposely built for training large deep learning models.
Score: 21.183733616898365
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Getting large language models (LLMs) to perform well on the downstream tasks requires pre-training over trillions of tokens. This typically demands a large number of powerful computational devices in addition to a stable distributed training framework to accelerate the training. The growing number of applications leveraging AI/ML had led to a scarcity of the expensive conventional accelerators (such as GPUs), which begs the need for the alternative specialized-accelerators that are scalable and cost-efficient. AWS Trainium is the second-generation machine learning accelerator that has been purposely built for training large deep learning models. Its corresponding instance, Amazon EC2 trn1, is an alternative to GPU instances for LLM training. However, training LLMs with billions of parameters on trn1 is challenging due to its relatively nascent software ecosystem. In this paper, we showcase HLAT: a 7 billion parameter decoder-only LLM pre-trained using trn1 instances over 1.8 trillion tokens. The performance of HLAT is benchmarked against popular open source baseline models including LLaMA and OpenLLaMA, which have been trained on NVIDIA GPUs and Google TPUs, respectively. On various evaluation tasks, we show that HLAT achieves model quality on par with the baselines. We also share the best practice of using the Neuron Distributed Training Library (NDTL), a customized distributed training library for AWS Trainium to achieve efficient training. Our work demonstrates that AWS Trainium powered by the NDTL is able to successfully pre-train state-of-the-art LLM models with high performance and cost-effectiveness.

Related papers

Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better [24.03797089794804]
We propose a Late-to-Early Training (LET) paradigm that enables Large Language Models to learn later knowledge in earlier steps and earlier layers.<n>We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning.<n>Our method achieves up to 1.6$times$ speedup with nearly 5% improvement in downstream task accuracy compared to standard training.
arXiv Detail & Related papers (2026-02-05T07:19:34Z)
NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium [4.7520621855466425]
We design high-performance matmul, a critical compute kernel, for LLM inference on Trainium.<n>We show that our system largely outperforms the state-of-the-art matmul implemented by AWS on Trainium.
arXiv Detail & Related papers (2025-10-29T21:22:08Z)
An Efficient Training Algorithm for Models with Block-wise Sparsity [6.882042556551613]
We propose an efficient training algorithm to decrease both computation and memory costs during training and inference. Our algorithms can decrease the computation and memory costs significantly without a performance drop compared to baselines.
arXiv Detail & Related papers (2025-03-27T19:14:27Z)
AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation. Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads. We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z)
NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks. We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
Patch-Level Training for Large Language Models [69.67438563485887]
This paper introduces patch-level training for Large Language Models (LLMs) During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch. Following this, the model continues token-level training on the remaining training data to align with the inference mode.
arXiv Detail & Related papers (2024-07-17T15:48:39Z)
Instruction Pre-Training: Language Models are Supervised Multitask Learners [115.95022434390181]
In this paper, we propose a framework that augments massive raw corpora with instruction-response pairs to pre-train language models (LMs) In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training.
arXiv Detail & Related papers (2024-06-20T16:55:33Z)
Sparsity-Accelerated Training for Large Language Models [20.86225596276327]
Large language models (LLMs) have demonstrated proficiency across various natural language processing (NLP) tasks. LLMs often require additional training, such as continual pre-training and supervised fine-tuning. This paper proposes leveraging emphsparsity in pre-trained LLMs to expedite this training process.
arXiv Detail & Related papers (2024-06-03T14:56:09Z)
Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models [43.1773057439246]
Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures. We explore sparse and recurrent model training on a massively parallel multiple instruction multiple data architecture with distributed local memory.
arXiv Detail & Related papers (2023-11-07T23:18:35Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference. We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch. We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z)
Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models [8.055533378391814]
We show that single-accelerator training jobs can dominate the cluster-wide resource consumption when launched repetitively. We propose Horizontally Fused Training Array (HFTA) to help DL researchers and practitioners effectively and easily improve the hardware utilization of their novel DL training workloads. HFTA demonstrates strong effectiveness in squeezing out hardware utilization and achieves up to $15.1 times$ higher training throughput vs. the standard practice of running each job on a separate accelerator.
arXiv Detail & Related papers (2021-02-03T23:56:55Z)
EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets [106.79387235014379]
EarlyBERT is a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models. We are the first to identify structured winning tickets in the early stage of BERT training, and use them for efficient training. EarlyBERT easily achieves comparable performance to standard BERT with 3545% less training time.
arXiv Detail & Related papers (2020-12-31T20:38:20Z)
Multi-node Bert-pretraining: Cost-efficient Approach [6.5998084177955425]
Large scale Transformer-based language models have brought about exciting leaps in state-of-the-art results for many Natural Language Processing (NLP) tasks. With the advent of large-scale unsupervised datasets, training time is further extended due to the increased amount of data samples within a single training epoch. We show that we are able to perform pre-training on BERT within a reasonable time budget (12 days) in an academic setting.
arXiv Detail & Related papers (2020-08-01T05:49:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.