Related papers: Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues

Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues

URL: http://arxiv.org/abs/2307.11888v3
Date: Wed, 5 Jun 2024 10:00:40 GMT
Title: Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues
Authors: Antonio Orvieto, Soham De, Caglar Gulcehre, Razvan Pascanu, Samuel L. Smith,
Abstract summary: We show that combining architectures with both real or complex linear diagonal recurrences leads to arbitrarily precise approximation of sequence-to-sequence maps. We prove that employing complex eigenvalues near unit disk - i.e., the most successful strategy in S4 - greatly helps the RNN in storing information.
Score: 32.783917920167205
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep neural networks based on linear RNNs interleaved with position-wise MLPs are gaining traction as competitive approaches for sequence modeling. Examples of such architectures include state-space models (SSMs) like S4, LRU, and Mamba: recently proposed models that achieve promising performance on text, genetics, and other data that require long-range reasoning. Despite experimental evidence highlighting these architectures' effectiveness and computational efficiency, their expressive power remains relatively unexplored, especially in connection to specific choices crucial in practice - e.g., carefully designed initialization distribution and potential use of complex numbers. In this paper, we show that combining MLPs with both real or complex linear diagonal recurrences leads to arbitrarily precise approximation of regular causal sequence-to-sequence maps. At the heart of our proof, we rely on a separation of concerns: the linear RNN provides a lossless encoding of the input sequence, and the MLP performs non-linear processing on this encoding. While we show that real diagonal linear recurrences are enough to achieve universality in this architecture, we prove that employing complex eigenvalues near unit disk - i.e., empirically the most successful strategy in S4 - greatly helps the RNN in storing information. We connect this finding with the vanishing gradient issue and provide experiments supporting our claims.

Related papers

Fixed-Point RNNs: From Diagonal to Dense in a Few Iterations [10.851383867834052]
We compute a dense linear RNN as the fixed-point of a parallelizable diagonal linear RNN in a single layer. We achieve state-of-the-art results on the commonly used toy tasks $A_5$, $S_5$, copying, and modular arithmetics.
arXiv Detail & Related papers (2025-03-13T18:50:22Z)
Fréchet Cumulative Covariance Net for Deep Nonlinear Sufficient Dimension Reduction with Random Objects [22.156257535146004]
We introduce a new statistical dependence measure termed Fr'echet Cumulative Covariance (FCCov) and develop a novel nonlinear SDR framework based on FCCov. Our approach is not only applicable to complex non-Euclidean data, but also exhibits robustness against outliers. We prove that our method with squared Frobenius norm regularization achieves unbiasedness at the $sigma$-field level.
arXiv Detail & Related papers (2025-02-21T10:55:50Z)
Almost-Linear RNNs Yield Highly Interpretable Symbolic Codes in Dynamical Systems Reconstruction [8.473495734873872]
We introduce Almost-Linear Recurrent Neural Networks (AL-RNNs) which automatically and robustly produce parsimonious PWL representations of Dynamical Systems (DS) from time series data. AL-RNNs can be efficiently trained with any SOTA algorithm for dynamical systems reconstruction (DSR) We show that for the Lorenz and R"ossler systems, AL-RNNs discover, in a purely data-driven way, the known topologically minimal PWL representations of the corresponding chaotic attractors.
arXiv Detail & Related papers (2024-10-18T07:44:12Z)
Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations [54.17275171325324]
We present a counterexample to the Linear Representation Hypothesis (LRH) When trained to repeat an input token sequence, neural networks learn to represent the token at each position with a particular order of magnitude, rather than a direction. These findings strongly indicate that interpretability research should not be confined to the LRH.
arXiv Detail & Related papers (2024-08-20T15:04:37Z)
Universal In-Context Approximation By Prompting Fully Recurrent Models [86.61942787684272]
We show that RNNs, LSTMs, GRUs, Linear RNNs, and linear gated architectures can serve as universal in-context approximators. We introduce a programming language called LSRL that compiles to fully recurrent architectures.
arXiv Detail & Related papers (2024-06-03T15:25:13Z)
The Convex Landscape of Neural Networks: Characterizing Global Optima and Stationary Points via Lasso Models [75.33431791218302]
Deep Neural Network Network (DNN) models are used for programming purposes. In this paper we examine the use of convex neural recovery models. We show that all the stationary non-dimensional objective objective can be characterized as the standard a global subsampled convex solvers program. We also show that all the stationary non-dimensional objective objective can be characterized as the standard a global subsampled convex solvers program.
arXiv Detail & Related papers (2023-12-19T23:04:56Z)
Attentive Multi-Layer Perceptron for Non-autoregressive Generation [46.14195464583495]
Non-autoregressive(NAR) generation gains increasing popularity for its efficiency and growing efficacy. In this paper, we propose a novel variant, textbfAttentive textbfMulti-textbfLayer textbfPerceptron(AMLP), to produce a generation model with linear time and space complexity.
arXiv Detail & Related papers (2023-10-14T06:44:24Z)
Learning Low Dimensional State Spaces with Overparameterized Recurrent Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory. Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z)
Assessing the Unitary RNN as an End-to-End Compositional Model of Syntax [0.0]
We show that both an LSTM and a unitary-evolution recurrent neural network (URN) can achieve encouraging accuracy on two types of syntactic patterns.
arXiv Detail & Related papers (2022-08-11T09:30:49Z)
How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks [80.55378250013496]
We study how neural networks trained by gradient descent extrapolate what they learn outside the support of the training distribution. Graph Neural Networks (GNNs) have shown some success in more complex tasks.
arXiv Detail & Related papers (2020-09-24T17:48:59Z)
Provably Efficient Neural Estimation of Structural Equation Model: An Adversarial Approach [144.21892195917758]
We study estimation in a class of generalized Structural equation models (SEMs) We formulate the linear operator equation as a min-max game, where both players are parameterized by neural networks (NNs), and learn the parameters of these neural networks using a gradient descent. For the first time we provide a tractable estimation procedure for SEMs based on NNs with provable convergence and without the need for sample splitting.
arXiv Detail & Related papers (2020-07-02T17:55:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.