Related papers: CoMERA: Computing- and Memory-Efficient Training via Rank-Adaptive Tensor Optimization

CoMERA: Computing- and Memory-Efficient Training via Rank-Adaptive Tensor Optimization

URL: http://arxiv.org/abs/2405.14377v1
Date: Thu, 23 May 2024 09:52:15 GMT
Title: CoMERA: Computing- and Memory-Efficient Training via Rank-Adaptive Tensor Optimization
Authors: Zi Yang, Samridhi Choudhary, Xinfeng Xie, Cao Gao, Siegfried Kunzmann, Zheng Zhang,
Abstract summary: Training large AI models such as deep learning recommendation systems and foundation language (or multi-modal) models costs massive GPU and computing time. CoMERA achieves end-to-end rank-adaptive tensor-compressed training via a multi-objective optimization formulation. CoMERA is $2times$ faster per training epoch and $9times$ more memory-efficient than GaLore on a tested six-encoder transformer with single-batch training.
Score: 10.319009303849109
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Training large AI models such as deep learning recommendation systems and foundation language (or multi-modal) models costs massive GPUs and computing time. The high training cost has become only affordable to big tech companies, meanwhile also causing increasing concerns about the environmental impact. This paper presents CoMERA, a Computing- and Memory-Efficient training method via Rank-Adaptive tensor optimization. CoMERA achieves end-to-end rank-adaptive tensor-compressed training via a multi-objective optimization formulation, and improves the training to provide both a high compression ratio and excellent accuracy in the training process. Our optimized numerical computation (e.g., optimized tensorized embedding and tensor-vector contractions) and GPU implementation eliminate part of the run-time overhead in the tensorized training on GPU. This leads to, for the first time, $2-3\times$ speedup per training epoch compared with standard training. CoMERA also outperforms the recent GaLore in terms of both memory and computing efficiency. Specifically, CoMERA is $2\times$ faster per training epoch and $9\times$ more memory-efficient than GaLore on a tested six-encoder transformer with single-batch training. With further HPC optimization, CoMERA may significantly reduce the training cost of large language models.

Related papers

CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation [17.807249890437767]
We introduce CoLA and its memory-efficient implementation, CoLA-M. We leverage the low-rank structure observed widely in model activations to reduce model size, boost model capacity and training efficiency. Experiments on LLaMA models with 60 million to 7 billion parameters show that CoLA reduces the computing cost by $bf 2pmbtimes$ and improves training throughput by $bf 1.86pmbtimes$ while maintaining full-rank level performance.
arXiv Detail & Related papers (2025-02-16T01:05:16Z)
APOLLO: SGD-like Memory, AdamW-level Performance [61.53444035835778]
Large language models (LLMs) are notoriously memory-intensive during training. Various memory-efficient Scals have been proposed to reduce memory usage. They face critical challenges: (i) costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial memory overhead to maintain competitive performance.
arXiv Detail & Related papers (2024-12-06T18:55:34Z)
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training [47.07768822212081]
COAT (States and Activations for FP8 Training) is a novel FP8 training framework designed to significantly reduce memory footprint when training large models. COAT effectively reduces end-to-end training memory footprint by 1.54x compared to BF16. COAT also achieves a 1.43x end-to-end training speedup compared to BF16.
arXiv Detail & Related papers (2024-10-25T05:59:30Z)
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [133.45193150403537]
Training Large Language Models (LLMs) presents significant memory challenges due to the growing size of weights and GPU states. In this work, we propose Gradient Low-Rank Projection (GaLore) as a memory-efficient training strategy. Our 8-bit GaLore further reduces memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.
arXiv Detail & Related papers (2024-03-06T07:29:57Z)
Breaking MLPerf Training: A Case Study on Optimizing BERT [9.486916730173661]
We present novel approaches for fast large-scale training of BERT model. Load balancing is imperative in distributed BERT training since its training are characterized by samples with various lengths. We propose two new ideas, (1) local presorting based on dataset stratification for load balancing and (2) bucket-wise gradient clipping before allreduce.
arXiv Detail & Related papers (2024-02-04T11:12:17Z)
AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models. AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z)
Parameter-efficient is not sufficient: Exploring Parameter, Memory, and Time Efficient Adapter Tuning for Dense Predictions [9.068569788978854]
parameter-efficient transfer learning (PETL) methods have shown promising performance in adapting to downstream tasks with only a few trainable parameters. PETL methods in computer vision (CV) can be computationally expensive and require large amounts of memory and time cost during training. mathrmE3VA$ can save up to 62.2% training memory and 26.2% training time on average.
arXiv Detail & Related papers (2023-06-16T09:54:07Z)
RAF: Holistic Compilation for Deep Learning Model Training [17.956035630476173]
In this paper, we present RAF, a deep learning compiler for training. Unlike existing DLCs, RAF accepts a forward model and in-house generates a training graph. RAF is able to systematically consolidate graph optimizations for performance, memory and distributed training.
arXiv Detail & Related papers (2023-03-08T17:51:13Z)
Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution. Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x. We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z)
Staged Training for Transformer Language Models [47.99321376123886]
We consider a staged training setup that begins with a small model and incrementally increases the amount of compute used for training. By initializing each stage with the output of the previous one, the training process effectively re-uses the compute. We empirically validate our growth operators and staged training for autoregressive language models, showing up to 22% compute savings.
arXiv Detail & Related papers (2022-03-11T19:05:42Z)
ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive Training [65.68511423300812]
We propose ProgFed, a progressive training framework for efficient and effective federated learning. ProgFed inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models. Our results show that ProgFed converges at the same rate as standard training on full models.
arXiv Detail & Related papers (2021-10-11T14:45:00Z)
[Reproducibility Report] Rigging the Lottery: Making All Tickets Winners [1.6884611234933766]
$textitRigL$, a sparse training algorithm, claims to directly train sparse networks that match or exceed the performance of existing dense-to-sparse training techniques. We implement $textitRigL$ from scratch in Pytorch and reproduce its performance on CIFAR-10 within 0.1% of the reported value.
arXiv Detail & Related papers (2021-03-29T17:01:11Z)
Large-Scale Training System for 100-Million Classification at Alibaba [43.58719630882661]
extreme classification has become an essential topic for deep learning. It is very challenging to train a deep model with millions of classes due to the memory and explosion in the last output layer. We build a hybrid parallel training framework to make the training process feasible. Second, we propose a novel softmax variation named KNN softmax, which reduces both the GPU memory consumption and computation costs.
arXiv Detail & Related papers (2021-02-09T06:53:31Z)
FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training [81.85361544720885]
We propose FracTrain that integrates progressive fractional quantization which gradually increases the precision of activations, weights, and gradients. FracTrain reduces computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12%+1.87%) accuracy.
arXiv Detail & Related papers (2020-12-24T05:24:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.