Implicit Bias of Linear RNNs
- URL: http://arxiv.org/abs/2101.07833v1
- Date: Tue, 19 Jan 2021 19:39:28 GMT
- Title: Implicit Bias of Linear RNNs
- Authors: Melikasadat Emami, Mojtaba Sahraee-Ardakan, Parthe Pandit, Sundeep
Rangan, Alyson K. Fletcher
- Abstract summary: Linear recurrent neural networks (RNNs) do not perform well on tasks requiring long-term memory.
This paper provides a rigorous explanation of this property in the special case of linear RNNs.
Using recently-developed kernel regime analysis, our main result shows that linear RNNs are functionally equivalent to a certain weighted 1D-convolutional network.
- Score: 27.41989861342218
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contemporary wisdom based on empirical studies suggests that standard
recurrent neural networks (RNNs) do not perform well on tasks requiring
long-term memory. However, precise reasoning for this behavior is still
unknown. This paper provides a rigorous explanation of this property in the
special case of linear RNNs. Although this work is limited to linear RNNs, even
these systems have traditionally been difficult to analyze due to their
non-linear parameterization. Using recently-developed kernel regime analysis,
our main result shows that linear RNNs learned from random initializations are
functionally equivalent to a certain weighted 1D-convolutional network.
Importantly, the weightings in the equivalent model cause an implicit bias to
elements with smaller time lags in the convolution and hence, shorter memory.
The degree of this bias depends on the variance of the transition kernel matrix
at initialization and is related to the classic exploding and vanishing
gradients problem. The theory is validated in both synthetic and real data
experiments.
Related papers
- Matrix Completion via Nonsmooth Regularization of Fully Connected Neural Networks [7.349727826230864]
It has been shown that enhanced performance could be attained by using nonlinear estimators such as deep neural networks.
In this paper, we control over-fitting by regularizing FCNN model in terms of norm intermediate representations.
Our simulations indicate the superiority of the proposed algorithm in comparison with existing linear and nonlinear algorithms.
arXiv Detail & Related papers (2024-03-15T12:00:37Z) - How neural networks learn to classify chaotic time series [77.34726150561087]
We study the inner workings of neural networks trained to classify regular-versus-chaotic time series.
We find that the relation between input periodicity and activation periodicity is key for the performance of LKCNN models.
arXiv Detail & Related papers (2023-06-04T08:53:27Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Inverse Approximation Theory for Nonlinear Recurrent Neural Networks [28.840757822712195]
We prove an inverse approximation theorem for the approximation of nonlinear sequence-to-sequence relationships using recurrent neural networks (RNNs)
We show that nonlinear sequence relationships that can be stably approximated by nonlinear RNNs must have an exponential decaying memory structure.
This extends the previously identified curse of memory in linear RNNs into the general nonlinear setting.
arXiv Detail & Related papers (2023-05-30T16:34:28Z) - Learning Discretized Neural Networks under Ricci Flow [51.36292559262042]
We study Discretized Neural Networks (DNNs) composed of low-precision weights and activations.
DNNs suffer from either infinite or zero gradients due to the non-differentiable discrete function during training.
arXiv Detail & Related papers (2023-02-07T10:51:53Z) - Learning Low Dimensional State Spaces with Overparameterized Recurrent
Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory.
Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z) - Discovering Invariant Rationales for Graph Neural Networks [104.61908788639052]
Intrinsic interpretability of graph neural networks (GNNs) is to find a small subset of the input graph's features.
We propose a new strategy of discovering invariant rationale (DIR) to construct intrinsically interpretable GNNs.
arXiv Detail & Related papers (2022-01-30T16:43:40Z) - Fast Axiomatic Attribution for Neural Networks [44.527672563424545]
Recent approaches include priors on the feature attribution of a deep neural network (DNN) into the training process to reduce the dependence on unwanted features.
We consider a special class of efficiently axiomatically attributable DNNs for which an axiomatic feature attribution can be computed with only a single forward/backward pass.
Various experiments demonstrate the advantages of $mathcalX$-DNNs, beating state-of-the-art generic attribution methods on regular DNNs for training with attribution priors.
arXiv Detail & Related papers (2021-11-15T10:51:01Z) - How to train RNNs on chaotic data? [7.276372008305615]
Recurrent neural networks (RNNs) are wide-spread machine learning tools for modeling sequential and time series data.
They are notoriously hard to train because their loss gradients backpropagated in time tend to saturate or diverge during training.
Here we offer a comprehensive theoretical treatment of this problem by relating the loss gradients during RNN training to the Lyapunov spectrum of RNN-generated orbits.
arXiv Detail & Related papers (2021-10-14T09:07:42Z) - Improving predictions of Bayesian neural nets via local linearization [79.21517734364093]
We argue that the Gauss-Newton approximation should be understood as a local linearization of the underlying Bayesian neural network (BNN)
Because we use this linearized model for posterior inference, we should also predict using this modified model instead of the original one.
We refer to this modified predictive as "GLM predictive" and show that it effectively resolves common underfitting problems of the Laplace approximation.
arXiv Detail & Related papers (2020-08-19T12:35:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.