A Truly Sparse and General Implementation of Gradient-Based Synaptic Plasticity
- URL: http://arxiv.org/abs/2501.11407v1
- Date: Mon, 20 Jan 2025 11:14:11 GMT
- Title: A Truly Sparse and General Implementation of Gradient-Based Synaptic Plasticity
- Authors: Jamie Lohoff, Anil Kaya, Florian Assmuth, Emre Neftci,
- Abstract summary: We present a custom automatic differentiation (AD) pipeline for sparse and online implementation of gradient-based synaptic plasticity rules.<n>Our work combines the programming ease of backpropagation-type methods for forward AD while being memory-efficient.<n>We demonstrate how memory utilization scales with network size without dependence on the sequence length.
- Score: 0.7617849765320394
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online synaptic plasticity rules derived from gradient descent achieve high accuracy on a wide range of practical tasks. However, their software implementation often requires tediously hand-derived gradients or using gradient backpropagation which sacrifices the online capability of the rules. In this work, we present a custom automatic differentiation (AD) pipeline for sparse and online implementation of gradient-based synaptic plasticity rules that generalizes to arbitrary neuron models. Our work combines the programming ease of backpropagation-type methods for forward AD while being memory-efficient. To achieve this, we exploit the advantageous compute and memory scaling of online synaptic plasticity by providing an inherently sparse implementation of AD where expensive tensor contractions are replaced with simple element-wise multiplications if the tensors are diagonal. Gradient-based synaptic plasticity rules such as eligibility propagation (e-prop) have exactly this property and thus profit immensely from this feature. We demonstrate the alignment of our gradients with respect to gradient backpropagation on an synthetic task where e-prop gradients are exact, as well as audio speech classification benchmarks. We demonstrate how memory utilization scales with network size without dependence on the sequence length, as expected from forward AD methods.
Related papers
- Deep learning for pedestrians: backpropagation in Transformers [1.14219428942199]
We apply our index-free methodology to new types of layers such as embedding, multi-headed self-attention and layer normalization.<n>A complete PyTorch implementation of a minimalistic GPT-like network is also provided along with analytical expressions for of all of its gradient updates.
arXiv Detail & Related papers (2025-12-29T09:26:19Z) - Gradient Descent with Provably Tuned Learning-rate Schedules [14.391648046717073]
We develop novel analytical tools for provably tuning factors in gradient-based algorithms.<n>Our analysis applies to neural networks with commonly used activation functions.
arXiv Detail & Related papers (2025-12-04T18:49:58Z) - Preserving Plasticity in Continual Learning with Adaptive Linearity Injection [10.641213440191551]
Loss of plasticity in deep neural networks is the gradual reduction in a model's capacity to incrementally learn.<n>Recent work has shown that deep linear networks tend to be resilient towards loss of plasticity.<n>We propose Adaptive Linearization (AdaLin), a general approach that dynamically adapts each neuron's activation function to mitigate plasticity loss.
arXiv Detail & Related papers (2025-05-14T15:36:51Z) - To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions [6.653325043862049]
We study gradient clipping in a least squares problem under streaming SGD.
We show that with Gaussian noise clipping cannot improve SGD performance.
We propose a simple for near optimal scheduling of the clipping threshold.
arXiv Detail & Related papers (2024-06-17T16:50:22Z) - How to guess a gradient [68.98681202222664]
We show that gradients are more structured than previously thought.
Exploiting this structure can significantly improve gradient-free optimization schemes.
We highlight new challenges in overcoming the large gap between optimizing with exact gradients and guessing the gradients.
arXiv Detail & Related papers (2023-12-07T21:40:44Z) - Nonsmooth automatic differentiation: a cheap gradient principle and
other complexity results [0.0]
We provide a model to estimate the computational costs of the backward and forward modes of algorithmic differentiation for a wide class of nonsmooth programs.
Prominent examples are the famous relu and convolutional neural networks together with their standard loss functions.
arXiv Detail & Related papers (2022-06-01T08:43:35Z) - Continuous-Time Meta-Learning with Forward Mode Differentiation [65.26189016950343]
We introduce Continuous Meta-Learning (COMLN), a meta-learning algorithm where adaptation follows the dynamics of a gradient vector field.
Treating the learning process as an ODE offers the notable advantage that the length of the trajectory is now continuous.
We show empirically its efficiency in terms of runtime and memory usage, and we illustrate its effectiveness on a range of few-shot image classification problems.
arXiv Detail & Related papers (2022-03-02T22:35:58Z) - Progressive Encoding for Neural Optimization [92.55503085245304]
We show the competence of the PPE layer for mesh transfer and its advantages compared to contemporary surface mapping techniques.
Most importantly, our technique is a parameterization-free method, and thus applicable to a variety of target shape representations.
arXiv Detail & Related papers (2021-04-19T08:22:55Z) - Self-Tuning Stochastic Optimization with Curvature-Aware Gradient
Filtering [53.523517926927894]
We explore the use of exact per-sample Hessian-vector products and gradients to construct self-tuning quadratics.
We prove that our model-based procedure converges in noisy gradient setting.
This is an interesting step for constructing self-tuning quadratics.
arXiv Detail & Related papers (2020-11-09T22:07:30Z) - Implicit Under-Parameterization Inhibits Data-Efficient Deep
Reinforcement Learning [97.28695683236981]
More gradient updates decrease the expressivity of the current value network.
We demonstrate this phenomenon on Atari and Gym benchmarks, in both offline and online RL settings.
arXiv Detail & Related papers (2020-10-27T17:55:16Z) - Channel-Directed Gradients for Optimization of Convolutional Neural
Networks [50.34913837546743]
We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error.
We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental.
arXiv Detail & Related papers (2020-08-25T00:44:09Z) - Randomized Automatic Differentiation [22.95414996614006]
We develop a general framework and approach for randomized automatic differentiation (RAD)
RAD can allow unbiased estimates to be computed with reduced memory in return for variance.
We show that RAD converges in fewer iterations than using a small batch size for feedforward networks, and in a similar number for recurrent networks.
arXiv Detail & Related papers (2020-07-20T19:03:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.