Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues
- URL: http://arxiv.org/abs/2307.11888v3
- Date: Wed, 5 Jun 2024 10:00:40 GMT
- Title: Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues
- Authors: Antonio Orvieto, Soham De, Caglar Gulcehre, Razvan Pascanu, Samuel L. Smith,
- Abstract summary: We show that combining architectures with both real or complex linear diagonal recurrences leads to arbitrarily precise approximation of sequence-to-sequence maps.
We prove that employing complex eigenvalues near unit disk - i.e., the most successful strategy in S4 - greatly helps the RNN in storing information.
- Score: 32.783917920167205
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep neural networks based on linear RNNs interleaved with position-wise MLPs are gaining traction as competitive approaches for sequence modeling. Examples of such architectures include state-space models (SSMs) like S4, LRU, and Mamba: recently proposed models that achieve promising performance on text, genetics, and other data that require long-range reasoning. Despite experimental evidence highlighting these architectures' effectiveness and computational efficiency, their expressive power remains relatively unexplored, especially in connection to specific choices crucial in practice - e.g., carefully designed initialization distribution and potential use of complex numbers. In this paper, we show that combining MLPs with both real or complex linear diagonal recurrences leads to arbitrarily precise approximation of regular causal sequence-to-sequence maps. At the heart of our proof, we rely on a separation of concerns: the linear RNN provides a lossless encoding of the input sequence, and the MLP performs non-linear processing on this encoding. While we show that real diagonal linear recurrences are enough to achieve universality in this architecture, we prove that employing complex eigenvalues near unit disk - i.e., empirically the most successful strategy in S4 - greatly helps the RNN in storing information. We connect this finding with the vanishing gradient issue and provide experiments supporting our claims.
Related papers
- RandONet: Shallow-Networks with Random Projections for learning linear and nonlinear operators [0.0]
We present Random Projection-based Operator Networks (RandONets)
RandONets are shallow networks with random projections that learn linear and nonlinear operators.
We show, that for this particular task, RandONets outperform, both in terms of numerical approximation accuracy and computational cost, the vanilla" DeepOnets.
arXiv Detail & Related papers (2024-06-08T13:20:48Z) - Universal In-Context Approximation By Prompting Fully Recurrent Models [86.61942787684272]
We show that RNNs, LSTMs, GRUs, Linear RNNs, and linear gated architectures can serve as universal in-context approximators.
We introduce a programming language called LSRL that compiles to fully recurrent architectures.
arXiv Detail & Related papers (2024-06-03T15:25:13Z) - The Convex Landscape of Neural Networks: Characterizing Global Optima
and Stationary Points via Lasso Models [75.33431791218302]
Deep Neural Network Network (DNN) models are used for programming purposes.
In this paper we examine the use of convex neural recovery models.
We show that all the stationary non-dimensional objective objective can be characterized as the standard a global subsampled convex solvers program.
We also show that all the stationary non-dimensional objective objective can be characterized as the standard a global subsampled convex solvers program.
arXiv Detail & Related papers (2023-12-19T23:04:56Z) - Learning Low Dimensional State Spaces with Overparameterized Recurrent
Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory.
Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z) - Assessing the Unitary RNN as an End-to-End Compositional Model of Syntax [0.0]
We show that both an LSTM and a unitary-evolution recurrent neural network (URN) can achieve encouraging accuracy on two types of syntactic patterns.
arXiv Detail & Related papers (2022-08-11T09:30:49Z) - Learning Generative Prior with Latent Space Sparsity Constraints [25.213673771175692]
It has been argued that the distribution of natural images do not lie in a single manifold but rather lie in a union of several submanifolds.
We propose a sparsity-driven latent space sampling (SDLSS) framework and develop a proximal meta-learning (PML) algorithm to enforce sparsity in the latent space.
The results demonstrate that for a higher degree of compression, the SDLSS method is more efficient than the state-of-the-art method.
arXiv Detail & Related papers (2021-05-25T14:12:04Z) - Rank-R FNN: A Tensor-Based Learning Model for High-Order Data
Classification [69.26747803963907]
Rank-R Feedforward Neural Network (FNN) is a tensor-based nonlinear learning model that imposes Canonical/Polyadic decomposition on its parameters.
First, it handles inputs as multilinear arrays, bypassing the need for vectorization, and can thus fully exploit the structural information along every data dimension.
We establish the universal approximation and learnability properties of Rank-R FNN, and we validate its performance on real-world hyperspectral datasets.
arXiv Detail & Related papers (2021-04-11T16:37:32Z) - How Neural Networks Extrapolate: From Feedforward to Graph Neural
Networks [80.55378250013496]
We study how neural networks trained by gradient descent extrapolate what they learn outside the support of the training distribution.
Graph Neural Networks (GNNs) have shown some success in more complex tasks.
arXiv Detail & Related papers (2020-09-24T17:48:59Z) - Provably Efficient Neural Estimation of Structural Equation Model: An
Adversarial Approach [144.21892195917758]
We study estimation in a class of generalized Structural equation models (SEMs)
We formulate the linear operator equation as a min-max game, where both players are parameterized by neural networks (NNs), and learn the parameters of these neural networks using a gradient descent.
For the first time we provide a tractable estimation procedure for SEMs based on NNs with provable convergence and without the need for sample splitting.
arXiv Detail & Related papers (2020-07-02T17:55:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.