Related papers: Learnability Window in Gated Recurrent Neural Networks

Learnability Window in Gated Recurrent Neural Networks

URL: http://arxiv.org/abs/2512.05790v1
Date: Fri, 05 Dec 2025 15:16:59 GMT
Title: Learnability Window in Gated Recurrent Neural Networks
Authors: Lorenzo Livi,
Abstract summary: Gate mechanisms determine the learnability window $mathcalH_N$ of recurrent neural networks.<n>We show that learnability is governed by the empheffective learning rates $_t,ell$, per-lag and per-neuron quantities.
Score: 3.924071936547547
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We develop a theoretical framework that explains how gating mechanisms determine the learnability window $\mathcal{H}_N$ of recurrent neural networks, defined as the largest temporal horizon over which gradient information remains statistically recoverable. While classical analyses emphasize numerical stability of Jacobian products, we show that stability alone is insufficient: learnability is governed instead by the \emph{effective learning rates} $μ_{t,\ell}$, per-lag and per-neuron quantities obtained from first-order expansions of gate-induced Jacobian products in Backpropagation Through Time. These effective learning rates act as multiplicative filters that control both the magnitude and anisotropy of gradient transport. Under heavy-tailed ($α$-stable) gradient noise, we prove that the minimal sample size required to detect a dependency at lag~$\ell$ satisfies $N(\ell)\propto f(\ell)^{-α}$, where $f(\ell)=\|μ_{t,\ell}\|_1$ is the effective learning rate envelope. This leads to an explicit formula for $\mathcal{H}_N$ and closed-form scaling laws for logarithmic, polynomial, and exponential decay of $f(\ell)$. The theory predicts that broader or more heterogeneous gate spectra produce slower decay of $f(\ell)$ and hence larger learnability windows, whereas heavier-tailed noise compresses $\mathcal{H}_N$ by slowing statistical concentration. By linking gate-induced time-scale structure, gradient noise, and sample complexity, the framework identifies the effective learning rates as the fundamental quantities that govern when -- and for how long -- gated recurrent networks can learn long-range temporal dependencies.

Related papers

Gradient Descent as a Perceptron Algorithm: Understanding Dynamics and Implicit Acceleration [67.12978375116599]
We show that the steps of gradient descent (GD) reduce to those of generalized perceptron algorithms.<n>This helps explain the optimization dynamics and the implicit acceleration phenomenon observed in neural networks.
arXiv Detail & Related papers (2025-12-12T14:16:35Z)
Learning Intersections of Two Margin Halfspaces under Factorizable Distributions [56.51474048985742]
Learning intersections of halfspaces is a central problem in Computational Learning Theory.<n>Even for just two halfspaces, it remains a major open question whether learning is possible in time.<n>We introduce a novel algorithm that provably circumvents the CSQ hardness barrier.
arXiv Detail & Related papers (2025-11-13T00:28:24Z)
FFT-Accelerated Auxiliary Variable MCMC for Fermionic Lattice Models: A Determinant-Free Approach with $O(N\log N)$ Complexity [52.3171766248012]
We introduce a Markov Chain Monte Carlo (MCMC) algorithm that dramatically accelerates the simulation of quantum many-body systems.<n>We validate our algorithm on benchmark quantum physics problems, accurately reproducing known theoretical results.<n>Our work provides a powerful tool for large-scale probabilistic inference and opens avenues for physics-inspired generative models.
arXiv Detail & Related papers (2025-10-13T07:57:21Z)
Benefits of Learning Rate Annealing for Tuning-Robustness in Stochastic Optimization [29.174036532175855]
Learning rate in gradient methods is a critical hyperspecification that is notoriously costly to tune via standard grid search.<n>We identify a theoretical advantage of learning rate annealing schemes that decay the learning rate to zero at a rate, such as the widely-used cosine schedule.
arXiv Detail & Related papers (2025-03-12T14:06:34Z)
Faster Convergence of Riemannian Stochastic Gradient Descent with Increasing Batch Size [7.2620484413601325]
Using an increasing batch size leads to faster convergence than using a constant batch size.<n>An increasing batch size was found to reduce the complexity of RSGD.
arXiv Detail & Related papers (2025-01-30T06:23:28Z)
Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks. In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z)
Convergence of Gradient Descent for Recurrent Neural Networks: A Nonasymptotic Analysis [16.893624100273108]
We analyze recurrent neural networks with diagonal hidden-to-hidden weight matrices trained with gradient descent in the supervised learning setting. We prove that gradient descent can achieve optimality emphwithout massive over parameterization. Our results are based on an explicit characterization of the class of dynamical systems that can be approximated and learned by recurrent neural networks.
arXiv Detail & Related papers (2024-02-19T15:56:43Z)
Convex Relaxations of ReLU Neural Networks Approximate Global Optima in Polynomial Time [45.72323731094864]
In this paper, we study the optimality gap between two-layer ReLULU networks regularized with weight decay and their convex relaxations. Our study sheds new light on understanding why local methods work well.
arXiv Detail & Related papers (2024-02-06T01:29:35Z)
Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit [36.17720004582283]
This work conducts such an exploration through the lens of learning $k$-sparse parities of $n$ bits. We find that neural networks exhibit surprising phase transitions when scaling up dataset size and running time.
arXiv Detail & Related papers (2022-07-18T17:55:05Z)
Lassoed Tree Boosting [53.56229983630983]
We prove that a gradient boosted tree algorithm with early stopping faster than $n-1/4$ L2 convergence in the large nonparametric space of cadlag functions of bounded sectional variation. Our convergence proofs are based on a novel, general theorem on early stopping with empirical loss minimizers of nested Donsker classes.
arXiv Detail & Related papers (2022-05-22T00:34:41Z)
Improved rates for prediction and identification of partially observed linear dynamical systems [4.68299658663016]
Identification of a linear time-in dynamical system from partial observations is a fundamental problem in control theory. We propose an algorithm that learns such systems with non-asymptotic statistical rates depending on the inherentity $d$ of the system. Our algorithm is based on multi-scale low-rank approximation SVD applied to Hankel matrices of increasing sizes.
arXiv Detail & Related papers (2020-11-19T18:04:18Z)
On Function Approximation in Reinforcement Learning: Optimism in the Face of Large State Spaces [208.67848059021915]
We study the exploration-exploitation tradeoff at the core of reinforcement learning. In particular, we prove that the complexity of the function class $mathcalF$ characterizes the complexity of the function. Our regret bounds are independent of the number of episodes.
arXiv Detail & Related papers (2020-11-09T18:32:22Z)
Large-time asymptotics in deep learning [0.0]
We consider the impact of the final time $T$ (which may indicate the depth of a corresponding ResNet) in training. For the classical $L2$--regularized empirical risk minimization problem, we show that the training error is at most of the order $mathcalOleft(frac1Tright)$. In the setting of $ellp$--distance losses, we prove that both the training error and the optimal parameters are at most of the order $mathcalOleft(e-mu
arXiv Detail & Related papers (2020-08-06T07:33:17Z)
On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate. We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.