Related papers: BP(\lambda): Online Learning via Synthetic Gradients

BP(\lambda): Online Learning via Synthetic Gradients

URL: http://arxiv.org/abs/2401.07044v1
Date: Sat, 13 Jan 2024 11:13:06 GMT
Title: BP(\lambda): Online Learning via Synthetic Gradients
Authors: Joseph Pemberton and Rui Ponte Costa
Abstract summary: Training recurrent neural networks typically relies on backpropagation through time (BPTT) In their implementation synthetic gradients are learned through a mixture of backpropagated gradients and bootstrapped synthetic gradients. Inspired by the accumulate $mathrmTD(lambda)$ in RL, we propose a fully online method for learning synthetic gradients which avoids the use of BPTT altogether.
Score: 6.581214715240991
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training recurrent neural networks typically relies on backpropagation through time (BPTT). BPTT depends on forward and backward passes to be completed, rendering the network locked to these computations before loss gradients are available. Recently, Jaderberg et al. proposed synthetic gradients to alleviate the need for full BPTT. In their implementation synthetic gradients are learned through a mixture of backpropagated gradients and bootstrapped synthetic gradients, analogous to the temporal difference (TD) algorithm in Reinforcement Learning (RL). However, as in TD learning, heavy use of bootstrapping can result in bias which leads to poor synthetic gradient estimates. Inspired by the accumulate $\mathrm{TD}(\lambda)$ in RL, we propose a fully online method for learning synthetic gradients which avoids the use of BPTT altogether: accumulate $BP(\lambda)$. As in accumulate $\mathrm{TD}(\lambda)$, we show analytically that accumulate $\mathrm{BP}(\lambda)$ can control the level of bias by using a mixture of temporal difference errors and recursively defined eligibility traces. We next demonstrate empirically that our model outperforms the original implementation for learning synthetic gradients in a variety of tasks, and is particularly suited for capturing longer timescales. Finally, building on recent work we reflect on accumulate $\mathrm{BP}(\lambda)$ as a principle for learning in biological circuits. In summary, inspired by RL principles we introduce an algorithm capable of bias-free online learning via synthetic gradients.

Related papers

Decentralized Nonconvex Composite Federated Learning with Gradient Tracking and Momentum [78.27945336558987]
Decentralized server (DFL) eliminates reliance on client-client architecture. Non-smooth regularization is often incorporated into machine learning tasks. We propose a novel novel DNCFL algorithm to solve these problems.
arXiv Detail & Related papers (2025-04-17T08:32:25Z)
Imitation Learning in Discounted Linear MDPs without exploration assumptions [58.81226849657474]
We present a new algorithm for imitation learning in infinite horizon linear MDPs dubbed ILARL. We improve the dependence on the desired accuracy $epsilon$ from $mathcalO(epsilon-5)$ to $mathcalO(epsilon-4)$. Numerical experiments with linear function approximation shows that ILARL outperforms other commonly used algorithms.
arXiv Detail & Related papers (2024-05-03T15:28:44Z)
Variance Reduced Online Gradient Descent for Kernelized Pairwise Learning with Limited Memory [19.822215548822882]
Online gradient descent (OGD) algorithms have been proposed to handle online pairwise learning, where data arrives sequentially. Recent advancements in OGD algorithms have aimed to reduce the complexity of calculating online gradients, achieving complexities less than $O(T)$ and even as low as $O(1)$. In this study, we propose a limited memory OGD algorithm that extends to kernel online pairwise learning while improving the sublinear regret.
arXiv Detail & Related papers (2023-10-10T09:50:54Z)
Scalable Real-Time Recurrent Learning Using Columnar-Constructive Networks [19.248060562241296]
We propose two constraints that make real-time recurrent learning scalable. We show that by either decomposing the network into independent modules or learning the network in stages, we can make RTRL scale linearly with the number of parameters. We demonstrate the effectiveness of our approach over Truncated-BPTT on a prediction benchmark inspired by animal learning and by doing policy evaluation of pre-trained policies for Atari 2600 games.
arXiv Detail & Related papers (2023-01-20T23:17:48Z)
Online Training Through Time for Spiking Neural Networks [66.7744060103562]
Spiking neural networks (SNNs) are promising brain-inspired energy-efficient models. Recent progress in training methods has enabled successful deep SNNs on large-scale tasks with low latency. We propose online training through time (OTTT) for SNNs, which is derived from BPTT to enable forward-in-time learning.
arXiv Detail & Related papers (2022-10-09T07:47:56Z)
Training Overparametrized Neural Networks in Sublinear Time [14.918404733024332]
Deep learning comes at a tremendous computational and energy cost. We present a new and a subset of binary neural networks, as a small subset of search trees, where each corresponds to a subset of search trees (Ds) We believe this view would have further applications in analysis analysis of deep networks (Ds)
arXiv Detail & Related papers (2022-08-09T02:29:42Z)
A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks [11.461878019780597]
Gradient Descent might converge slowly in some deep neural networks. It remains mysterious whether gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup.
arXiv Detail & Related papers (2022-05-10T16:55:33Z)
Can we learn gradients by Hamiltonian Neural Networks? [68.8204255655161]
We propose a meta-learner based on ODE neural networks that learns gradients. We demonstrate that our method outperforms a meta-learner based on LSTM for an artificial task and the MNIST dataset with ReLU activations in the optimizee.
arXiv Detail & Related papers (2021-10-31T18:35:10Z)
Scalable Online Recurrent Learning Using Columnar Neural Networks [35.584855852204385]
An algorithm called RTRL can compute gradients for recurrent networks online but is computationally intractable for large networks. We propose a credit-assignment algorithm that approximates the gradients for recurrent learning in real-time using $O(n)$ operations and memory per-step.
arXiv Detail & Related papers (2021-03-09T23:45:13Z)
Beyond Lazy Training for Over-parameterized Tensor Decomposition [69.4699995828506]
We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data. Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
arXiv Detail & Related papers (2020-10-22T00:32:12Z)
Scaling Equilibrium Propagation to Deep ConvNets by Drastically Reducing its Gradient Estimator Bias [65.13042449121411]
In practice, training a network with the gradient estimates provided by EP does not scale to visual tasks harder than MNIST. We show that a bias in the gradient estimate of EP, inherent in the use of finite nudging, is responsible for this phenomenon. We apply these techniques to train an architecture with asymmetric forward and backward connections, yielding a 13.2% test error.
arXiv Detail & Related papers (2020-06-06T09:36:07Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.