Scalable Online Recurrent Learning Using Columnar Neural Networks
- URL: http://arxiv.org/abs/2103.05787v1
- Date: Tue, 9 Mar 2021 23:45:13 GMT
- Title: Scalable Online Recurrent Learning Using Columnar Neural Networks
- Authors: Khurram Javed, Martha White, Rich Sutton
- Abstract summary: An algorithm called RTRL can compute gradients for recurrent networks online but is computationally intractable for large networks.
We propose a credit-assignment algorithm that approximates the gradients for recurrent learning in real-time using $O(n)$ operations and memory per-step.
- Score: 35.584855852204385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Structural credit assignment for recurrent learning is challenging. An
algorithm called RTRL can compute gradients for recurrent networks online but
is computationally intractable for large networks. Alternatives, such as BPTT,
are not online. In this work, we propose a credit-assignment algorithm --
\algoname{} -- that approximates the gradients for recurrent learning in
real-time using $O(n)$ operations and memory per-step. Our method builds on the
idea that for modular recurrent networks, composed of columns with scalar
states, it is sufficient for a parameter to only track its influence on the
state of its column. We empirically show that as long as connections between
columns are sparse, our method approximates the true gradient well. In the
special case when there are no connections between columns, the $O(n)$ gradient
estimate is exact. We demonstrate the utility of the approach for both
recurrent state learning and meta-learning by comparing the estimated gradient
to the true gradient on a synthetic test-bed.
Related papers
- BP(\lambda): Online Learning via Synthetic Gradients [6.581214715240991]
Training recurrent neural networks typically relies on backpropagation through time (BPTT)
In their implementation synthetic gradients are learned through a mixture of backpropagated gradients and bootstrapped synthetic gradients.
Inspired by the accumulate $mathrmTD(lambda)$ in RL, we propose a fully online method for learning synthetic gradients which avoids the use of BPTT altogether.
arXiv Detail & Related papers (2024-01-13T11:13:06Z) - How to guess a gradient [68.98681202222664]
We show that gradients are more structured than previously thought.
Exploiting this structure can significantly improve gradient-free optimization schemes.
We highlight new challenges in overcoming the large gap between optimizing with exact gradients and guessing the gradients.
arXiv Detail & Related papers (2023-12-07T21:40:44Z) - Fighting Uncertainty with Gradients: Offline Reinforcement Learning via
Diffusion Score Matching [22.461036967440723]
We study smoothed distance to data as an uncertainty metric, and claim that it has two beneficial properties.
We show these gradients can be efficiently learned with score-matching techniques.
We propose Score-Guided Planning (SGP) to enable first-order planning in high-dimensional problems.
arXiv Detail & Related papers (2023-06-24T23:40:58Z) - Scalable Real-Time Recurrent Learning Using Columnar-Constructive
Networks [19.248060562241296]
We propose two constraints that make real-time recurrent learning scalable.
We show that by either decomposing the network into independent modules or learning the network in stages, we can make RTRL scale linearly with the number of parameters.
We demonstrate the effectiveness of our approach over Truncated-BPTT on a prediction benchmark inspired by animal learning and by doing policy evaluation of pre-trained policies for Atari 2600 games.
arXiv Detail & Related papers (2023-01-20T23:17:48Z) - Stochastic Gradient Descent with Dependent Data for Offline
Reinforcement Learning [4.421561004829125]
offline learning is useful in dealing with exploration-exploitation and enables data reuse in many applications.
In this work, we study two offline learning tasks: policy evaluation and policy learning.
arXiv Detail & Related papers (2022-02-06T20:54:36Z) - Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states.
Our method is widely applicable to classical DP-based inference.
It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z) - Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise
Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances.
Online descent (OGD) is a popular approach to handle streaming data in pairwise learning.
In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z) - Accumulated Decoupled Learning: Mitigating Gradient Staleness in
Inter-Layer Model Parallelization [16.02377434191239]
We propose an accumulated decoupled learning (ADL) which incorporates the gradient accumulation technique to mitigate the stale gradient effect.
We prove that the proposed method can converge to critical points, i.e., the gradients converge to 0, in spite of its asynchronous nature.
The ADL is shown to outperform several state-of-the-arts in the classification tasks, and is the fastest among the compared methods.
arXiv Detail & Related papers (2020-12-03T11:52:55Z) - LoCo: Local Contrastive Representation Learning [93.98029899866866]
We show that by overlapping local blocks stacking on top of each other, we effectively increase the decoder depth and allow upper blocks to implicitly send feedbacks to lower blocks.
This simple design closes the performance gap between local learning and end-to-end contrastive learning algorithms for the first time.
arXiv Detail & Related papers (2020-08-04T05:41:29Z) - Backward Feature Correction: How Deep Learning Performs Deep
(Hierarchical) Learning [66.05472746340142]
This paper analyzes how multi-layer neural networks can perform hierarchical learning _efficiently_ and _automatically_ by SGD on the training objective.
We establish a new principle called "backward feature correction", where the errors in the lower-level features can be automatically corrected when training together with the higher-level layers.
arXiv Detail & Related papers (2020-01-13T17:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.