Related papers: Parallel Training of GRU Networks with a Multi-Grid Solver for Long Sequences

Parallel Training of GRU Networks with a Multi-Grid Solver for Long Sequences

URL: http://arxiv.org/abs/2203.04738v1
Date: Mon, 7 Mar 2022 11:32:44 GMT
Title: Parallel Training of GRU Networks with a Multi-Grid Solver for Long Sequences
Authors: Gordon Euhyun Moon, Eric C. Cyr
Abstract summary: We present a novel parallel training scheme (called parallel-in-time) for Gated Recurrent Unit (GRU) networks. MGRIT partitions a sequence into multiple shorter sub-sequences and trains the sub-sequences on different processors in parallel. Experimental results on the HMDB51 dataset, where each video is an image sequence, demonstrate that the new parallel training scheme achieves up to 6.5$times$ speedup over a serial approach.
Score: 1.9798034349981162
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Parallelizing Gated Recurrent Unit (GRU) networks is a challenging task, as the training procedure of GRU is inherently sequential. Prior efforts to parallelize GRU have largely focused on conventional parallelization strategies such as data-parallel and model-parallel training algorithms. However, when the given sequences are very long, existing approaches are still inevitably performance limited in terms of training time. In this paper, we present a novel parallel training scheme (called parallel-in-time) for GRU based on a multigrid reduction in time (MGRIT) solver. MGRIT partitions a sequence into multiple shorter sub-sequences and trains the sub-sequences on different processors in parallel. The key to achieving speedup is a hierarchical correction of the hidden state to accelerate end-to-end communication in both the forward and backward propagation phases of gradient descent. Experimental results on the HMDB51 dataset, where each video is an image sequence, demonstrate that the new parallel training scheme achieves up to 6.5$\times$ speedup over a serial approach. As efficiency of our new parallelization strategy is associated with the sequence length, our parallel GRU algorithm achieves significant performance improvement as the sequence length increases.

Related papers

FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism [33.23902060961886]
Existing sequence parallelism methods assume homogeneous sequence lengths (i.e., all input sequences are equal in length) and therefore leverages a single, static scattering strategy for all input sequences. We show that the sequence lengths in LLM training corpora exhibit substantial variability, often following a long-tail distribution. We propose a Heterogeneous-adaptive sequence parallelism method to address this problem.
arXiv Detail & Related papers (2024-12-02T14:16:03Z)
Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster [61.83949316226113]
FastCoT is a model-agnostic framework based on parallel decoding. We show that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach.
arXiv Detail & Related papers (2023-11-14T15:56:18Z)
DeepPCR: Parallelizing Sequential Operations in Neural Networks [4.241834259165193]
We introduce DeepPCR, a novel algorithm which parallelizes typically sequential operations in order to speed up inference and training of neural networks. DeepPCR is based on interpreting a sequence of $L$ steps as the solution of a specific system of equations, which we recover using the Parallel Cyclic Reduction algorithm. To verify the theoretical lower complexity of the algorithm, and to identify regimes for speedup, we test the effectiveness of DeepPCR in parallelizing the forward and backward pass in multi-layer perceptrons.
arXiv Detail & Related papers (2023-09-28T10:15:30Z)
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models [34.74093040678323]
We introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training. DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention. Experiments show that DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline.
arXiv Detail & Related papers (2023-09-25T20:15:57Z)
Parallelizing non-linear sequential models over the sequence length [7.99707131886133]
We develop a parallel algorithm that accelerates GPU evaluation of sequential models by up to 3 orders of magnitude faster. We discover the efficacy of the Gated Recurrent Unit in a long time series classification problem with 17k time samples.
arXiv Detail & Related papers (2023-09-21T16:52:34Z)
Parallel Algorithms Align with Neural Execution [7.535219325248997]
Parallel algorithms however may exploit their full computational power, therefore requiring fewer layers to be executed. This drastically reduces training times, as we observe when comparing parallel implementations of searching, sorting and finding strongly connected components to their sequential counterparts on the CLRS framework.
arXiv Detail & Related papers (2023-07-08T21:28:20Z)
GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism [6.3568605707961]
Mini-batch training is commonly used to train Graph neural networks (GNNs) on large graphs. In this paper, we introduce a hybrid parallel mini-batch training paradigm called split parallelism. We show that split parallelism outperforms state-of-the-art mini-batch training systems like DGL, Quiver, and $P3$.
arXiv Detail & Related papers (2023-03-24T03:28:05Z)
Gradient Coding with Dynamic Clustering for Straggler-Tolerant Distributed Learning [55.052517095437]
gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers. A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is $straggling$ workers. Coded distributed techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers. We propose a novel dynamic GC scheme, which assigns redundant data to workers to acquire the flexibility to choose from among a set of possible codes depending on the past straggling behavior.
arXiv Detail & Related papers (2021-03-01T18:51:29Z)
Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation. We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z)
Fast and Complete: Enabling Complete Neural Network Verification with Rapid and Massively Parallel Incomplete Verifiers [112.23981192818721]
We propose to use backward mode linear relaxation based analysis (LiRPA) to replace Linear Programming (LP) during the BaB process. Unlike LP, LiRPA when applied naively can produce much weaker bounds and even cannot check certain conflicts of sub-domains during splitting. We demonstrate an order of magnitude speedup compared to existing LP-based approaches.
arXiv Detail & Related papers (2020-11-27T16:42:12Z)
Gradient Coding with Dynamic Clustering for Straggler Mitigation [57.9123881133818]
GC-DC regulates the number of straggling workers in each cluster based on the straggler behavior in the previous iteration. We numerically show that GC-DC provides significant improvements in the average completion time (of each iteration) with no increase in the communication load compared to the original GC scheme.
arXiv Detail & Related papers (2020-11-03T18:52:15Z)
Accelerating Feedforward Computation via Parallel Nonlinear Equation Solving [106.63673243937492]
Feedforward computation, such as evaluating a neural network or sampling from an autoregressive model, is ubiquitous in machine learning. We frame the task of feedforward computation as solving a system of nonlinear equations. We then propose to find the solution using a Jacobi or Gauss-Seidel fixed-point method, as well as hybrid methods of both. Our method is guaranteed to give exactly the same values as the original feedforward computation with a reduced (or equal) number of parallelizable iterations, and hence reduced time given sufficient parallel computing power.
arXiv Detail & Related papers (2020-02-10T10:11:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.