Related papers: LayerPipe2: Multistage Pipelining and Weight Recompute via Improved Exponential Moving Average for Training Neural Networks

LayerPipe2: Multistage Pipelining and Weight Recompute via Improved Exponential Moving Average for Training Neural Networks

URL: http://arxiv.org/abs/2512.08160v1
Date: Tue, 09 Dec 2025 01:35:08 GMT
Title: LayerPipe2: Multistage Pipelining and Weight Recompute via Improved Exponential Moving Average for Training Neural Networks
Authors: Nanda K. Unnikrishnan, Keshab K. Parhi,
Abstract summary: A principled understanding of how much gradient delay needs to be introduced at each layer to achieve desired level of pipelining was not addressed.<n>We identify where delays may be legally inserted and show that the required amount of delay follows directly from the network structure.<n>When pipelining is applied at every layer, the amount of delay depends only on the number of remaining downstream stages.
Score: 6.69087470775851
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In our prior work, LayerPipe, we had introduced an approach to accelerate training of convolutional, fully connected, and spiking neural networks by overlapping forward and backward computation. However, despite empirical success, a principled understanding of how much gradient delay needs to be introduced at each layer to achieve desired level of pipelining was not addressed. This paper, LayerPipe2, fills that gap by formally deriving LayerPipe using variable delayed gradient adaptation and retiming. We identify where delays may be legally inserted and show that the required amount of delay follows directly from the network structure where inner layers require fewer delays and outer layers require longer delays. When pipelining is applied at every layer, the amount of delay depends only on the number of remaining downstream stages. When layers are pipelined in groups, all layers in the group share the same assignment of delays. These insights not only explain previously observed scheduling patterns but also expose an often overlooked challenge that pipelining implicitly requires storage of historical weights. We overcome this storage bottleneck by developing a pipeline--aware moving average that reconstructs the required past states rather than storing them explicitly. This reduces memory cost without sacrificing the accuracy guarantees that makes pipelined learning viable. The result is a principled framework that illustrates how to construct LayerPipe architectures, predicts their delay requirements, and mitigates their storage burden, thereby enabling scalable pipelined training with controlled communication computation tradeoffs.

Related papers

Accelerated Predictive Coding Networks via Direct Kolen-Pollack Feedback Alignment [7.328567184271344]
Predictive coding (PC) is a biologically inspired algorithm for training neural networks that relies only on local updates.<n>We propose direct Kolen-Pollack predictive coding (DKP-PC)<n>It simultaneously addresses both feedback delay and exponential decay, yielding a more efficient and scalable variant of PC.
arXiv Detail & Related papers (2026-02-17T13:29:14Z)
Spiralformer: Low Latency Encoder for Streaming Speech Recognition with Circular Layer Skipping and Early Exiting [70.75913449565203]
A Transformer-based encoder has been widely used with block processing.<n>We propose a new encoder, Spiralformer, tailored for block processing by combining layer dropping and early exiting.<n> Experimentally, we observed that our method achieved 21.6% reduction in the averaged token emission delay in Librispeech.
arXiv Detail & Related papers (2025-10-01T14:56:45Z)
StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs [8.960494482210919]
We propose a memory-efficient Backpropagation (BP) method called StreamBP.<n>StreamBP performs a linear decomposition of the chain rule along the sequence dimension in a layer-wise manner.<n>Compared to gradient checkpointing, StreamBP scales up the maximum sequence length of BP by 2.8-5.5 times larger.
arXiv Detail & Related papers (2025-06-03T16:54:15Z)
Nesterov Method for Asynchronous Pipeline Parallel Optimization [59.79227116582264]
We introduce a variant of Nesterov Accelerated Gradient (NAG) for asynchronous optimization in Pipeline Parallelism.<n>Specifically, we modify the look-ahead step in NAG to effectively address the staleness in gradients.<n>We theoretically prove that our approach converges at a sublinear rate in the presence of fixed delay in gradients.
arXiv Detail & Related papers (2025-05-02T08:23:29Z)
Efficient Event-based Delay Learning in Spiking Neural Networks [0.1350479308585481]
Spiking Neural Networks (SNNs) compute using sparse communication and are attracting increased attention.<n>We propose a novel event-based training method for SNNs with delays, grounded in the EventProp formalism.<n>Our method supports multiple spikes per neuron and, to the best of our knowledge, is the first delay learning algorithm to be applied to recurrent SNNs.
arXiv Detail & Related papers (2025-01-13T13:44:34Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding [13.747101397628887]
We present an end-to-end solution to speed-up inference of large language models (LLMs) We apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an early exit loss where all transformer layers share the same exit. We show that this training recipe increases the accuracy of early exit at earlier layers, without adding any auxiliary layers or modules to the model.
arXiv Detail & Related papers (2024-04-25T16:20:23Z)
Robust Stochastically-Descending Unrolled Networks [85.6993263983062]
Deep unrolling is an emerging learning-to-optimize method that unrolls a truncated iterative algorithm in the layers of a trainable neural network.<n>We show that convergence guarantees and generalizability of the unrolled networks are still open theoretical problems.<n>We numerically assess unrolled architectures trained under the proposed constraints in two different applications.
arXiv Detail & Related papers (2023-12-25T18:51:23Z)
Boosting Pruned Networks with Linear Over-parameterization [8.796518772724955]
Structured pruning compresses neural networks by reducing channels (filters) for fast inference and low footprint at run-time. To restore accuracy after pruning, fine-tuning is usually applied to pruned networks. We propose a novel method that first linearly over- parameterizes the compact layers in pruned networks to enlarge the number of fine-tuning parameters.
arXiv Detail & Related papers (2022-04-25T05:30:26Z)
LayerPipe: Accelerating Deep Neural Network Training by Intra-Layer and Inter-Layer Gradient Pipelining and Multiprocessor Scheduling [6.549125450209931]
Training model parameters by backpropagation inherently create feedback loops. The proposed system, referred to as LayerPipe, reduces the number of clock cycles required for training.
arXiv Detail & Related papers (2021-08-14T23:51:00Z)
Rethinking Skip Connection with Layer Normalization in Transformers and ResNets [49.87919454950763]
Skip connection is a widely-used technique to improve the performance of deep neural networks. In this work, we investigate how the scale factors in the effectiveness of the skip connection.
arXiv Detail & Related papers (2021-05-15T11:44:49Z)
Training cascaded networks for speeded decisions using a temporal-difference loss [39.79639377894641]
Deep feedforward neural networks operate in sequential stages. In our work, we construct a cascaded ResNet by introducing a propagation delay into each residual block. Because information transmitted through skip connections avoids delays, the functional depth of the architecture increases over time.
arXiv Detail & Related papers (2021-02-19T08:40:19Z)
Fast and Complete: Enabling Complete Neural Network Verification with Rapid and Massively Parallel Incomplete Verifiers [112.23981192818721]
We propose to use backward mode linear relaxation based analysis (LiRPA) to replace Linear Programming (LP) during the BaB process. Unlike LP, LiRPA when applied naively can produce much weaker bounds and even cannot check certain conflicts of sub-domains during splitting. We demonstrate an order of magnitude speedup compared to existing LP-based approaches.
arXiv Detail & Related papers (2020-11-27T16:42:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.