Related papers: Demystifying Workload Imbalances in Large Transformer Model Training over Variable-length Sequences

Demystifying Workload Imbalances in Large Transformer Model Training over Variable-length Sequences

URL: http://arxiv.org/abs/2412.07894v1
Date: Tue, 10 Dec 2024 20:01:53 GMT
Title: Demystifying Workload Imbalances in Large Transformer Model Training over Variable-length Sequences
Authors: Haoyang Li, Fangcheng Fu, Sheng Lin, Hao Ge, Xuanyu Wang, Jiawen Niu, Jie Jiang, Bin Cui,
Abstract summary: We develop Hydraulis, which jointly optimize the parallel strategies and data assignment.<n> Empirical results demonstrate that Hydraulis outperforms existing systems by 1.32-2.66 times.
Score: 31.232756326457277
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To optimize large Transformer model training, efficient parallel computing and advanced data management are essential. However, current methods often assume a stable and uniform training workload, neglecting imbalances in data sampling and packing that can impede performance. Specifically, data sampling imbalance arises from uneven sequence length distribution of the training data, while data packing imbalance stems from the discrepancy between the linear memory complexity and quadratic time complexity of the attention mechanism. To address these imbalance issues, we develop Hydraulis, which jointly optimizes the parallel strategies and data assignment. For one thing, we introduce large model training with dynamic heterogeneous parallel strategies in response to the sequence length variations within and across training iterations. For another, we devise a two-stage data assignment approach, which strikes a good balance in terms of the training workloads both within and across model replicas. Empirical results demonstrate that Hydraulis outperforms existing systems by 1.32-2.66 times.

Related papers

Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models. We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z)
Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM [45.510445021130685]
Training Long-Context Large Language Models (LLMs) is challenging, as hybrid training with long-context and short-context data often leads to workload imbalances. Existing works mainly use data packing to alleviate this issue but fail to consider imbalanced attention computation and wasted communication overhead. This paper proposes Hierarchical Balance Packing (HBP), which designs a novel batch-construction method and training recipe to address those inefficiencies.
arXiv Detail & Related papers (2025-03-10T10:52:50Z)
FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism [33.23902060961886]
Existing sequence parallelism methods assume homogeneous sequence lengths (i.e., all input sequences are equal in length) and therefore leverages a single, static scattering strategy for all input sequences. We show that the sequence lengths in LLM training corpora exhibit substantial variability, often following a long-tail distribution. We propose a Heterogeneous-adaptive sequence parallelism method to address this problem.
arXiv Detail & Related papers (2024-12-02T14:16:03Z)
AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning [98.26836657967162]
textbfAgentOhana aggregates agent trajectories from distinct environments, spanning a wide array of scenarios. textbfxLAM-v0.1, a large action model tailored for AI agents, demonstrates exceptional performance across various benchmarks.
arXiv Detail & Related papers (2024-02-23T18:56:26Z)
AdaMerging: Adaptive Model Merging for Multi-Task Learning [68.75885518081357]
This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging) It aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data. Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11% improvement in performance.
arXiv Detail & Related papers (2023-10-04T04:26:33Z)
Alleviating the Effect of Data Imbalance on Adversarial Training [26.36714114672729]
We study adversarial training on datasets that obey the long-tailed distribution. We propose a new adversarial training framework -- Re-balancing Adversarial Training (REAT)
arXiv Detail & Related papers (2023-07-14T07:01:48Z)
RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z)
Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism. We implement and evaluate three common classes of compression algorithms. We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z)
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training [23.633810934134065]
Colossal-AI can achieve up to 2.76 times training speedup on large-scale models. System supports parallel training methods such as data, pipeline, tensor, and sequence parallelism.
arXiv Detail & Related papers (2021-10-28T04:45:55Z)
Data splitting improves statistical performance in overparametrized regimes [0.0]
Distributed learning is a common strategy to reduce the overall training time by exploiting multiple computing devices. We show that in this regime, data splitting has a regularizing effect, hence improving statistical performance and computational complexity.
arXiv Detail & Related papers (2021-10-21T08:10:56Z)
An Accurate and Efficient Large-scale Regression Method through Best Friend Clustering [10.273838113763192]
We propose a novel and simple data structure capturing the most important information among data samples. We combine the clustering with regression techniques as a parallel library and utilize a hybrid structure of data and model parallelism to make predictions.
arXiv Detail & Related papers (2021-04-22T01:34:29Z)
Training Transformers for Information Security Tasks: A Case Study on Malicious URL Prediction [3.660098145214466]
We implement a malicious/benign predictor URL based on a transformer architecture that is trained from scratch. We show that in contrast to conventional natural language processing (NLP) transformers, this model requires a different training approach to work well.
arXiv Detail & Related papers (2020-11-05T18:58:51Z)
Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods. We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods. Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
Understanding the Effects of Data Parallelism and Sparsity on Neural Network Training [126.49572353148262]
We study two factors in neural network training: data parallelism and sparsity. Despite their promising benefits, understanding of their effects on neural network training remains elusive.
arXiv Detail & Related papers (2020-03-25T10:49:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.