Related papers: Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues

Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues

URL: http://arxiv.org/abs/2411.12537v1
Date: Tue, 19 Nov 2024 14:35:38 GMT
Title: Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues
Authors: Riccardo Grazzi, Julien Siems, Jörg K. H. Franke, Arber Zela, Frank Hutter, Massimiliano Pontil,
Abstract summary: Linear Recurrent Neural Networks (LRNNs) have emerged as efficient alternatives to Transformers in large language modeling. LRNNs struggle to perform state-tracking which may impair performance in tasks such as code evaluation or tracking a chess game. Our work enhances the expressivity of modern LRNNs, broadening their applicability without changing the cost of training or inference.
Score: 65.41946981594567
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Linear Recurrent Neural Networks (LRNNs) such as Mamba, RWKV, GLA, mLSTM, and DeltaNet have emerged as efficient alternatives to Transformers in large language modeling, offering linear scaling with sequence length and improved training efficiency. However, LRNNs struggle to perform state-tracking which may impair performance in tasks such as code evaluation or tracking a chess game. Even parity, the simplest state-tracking task, which non-linear RNNs like LSTM handle effectively, cannot be solved by current LRNNs. Recently, Sarrof et al. (2024) demonstrated that the failure of LRNNs like Mamba to solve parity stems from restricting the value range of their diagonal state-transition matrices to $[0, 1]$ and that incorporating negative values can resolve this issue. We extend this result to non-diagonal LRNNs, which have recently shown promise in models such as DeltaNet. We prove that finite precision LRNNs with state-transition matrices having only positive eigenvalues cannot solve parity, while complex eigenvalues are needed to count modulo $3$. Notably, we also prove that LRNNs can learn any regular language when their state-transition matrices are products of identity minus vector outer product matrices, each with eigenvalues in the range $[-1, 1]$. Our empirical results confirm that extending the eigenvalue range of models like Mamba and DeltaNet to include negative values not only enables them to solve parity but consistently improves their performance on state-tracking tasks. Furthermore, pre-training LRNNs with an extended eigenvalue range for language modeling achieves comparable performance and stability while showing promise on code and math data. Our work enhances the expressivity of modern LRNNs, broadening their applicability without changing the cost of training or inference.

Related papers

Fixed-Point RNNs: From Diagonal to Dense in a Few Iterations [10.851383867834052]
We compute a dense linear RNN as the fixed-point of a parallelizable diagonal linear RNN in a single layer. We achieve state-of-the-art results on the commonly used toy tasks $A_5$, $S_5$, copying, and modular arithmetics.
arXiv Detail & Related papers (2025-03-13T18:50:22Z)
DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products [63.66021758150632]
Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive alternatives to Transformers for sequence modeling. Existing architectures face a fundamental trade-off between expressivity and efficiency, dictated by the structure of their state-transition matrices. We introduce DeltaProduct, which takes multiple ($n_h$) steps per token and achieves superior state-tracking and language modeling capabilities.
arXiv Detail & Related papers (2025-02-14T16:59:05Z)
HadamRNN: Binary and Sparse Ternary Orthogonal RNNs [6.524758376347808]
Binary and sparse ternary weights in neural networks enable faster computations and lighter representations. vanilla RNNs are highly sensitive to changes in their recurrent weights, making the binarization and ternarization of these weights inherently challenging. We present a new approach leveraging the properties of Hadamard matrices to parameterize a subset of binary and sparse ternary matrices.
arXiv Detail & Related papers (2025-01-28T09:16:28Z)
Were RNNs All We Needed? [53.393497486332]
We revisit traditional recurrent neural networks (RNNs) from over a decade ago. We show that by removing their hidden state dependencies from their input, forget, and update gates, LSTMs and GRUs no longer need to BPTT and can be efficiently trained in parallel.
arXiv Detail & Related papers (2024-10-02T03:06:49Z)
Learning nonlinear integral operators via Recurrent Neural Networks and its application in solving Integro-Differential Equations [4.011446845089061]
We learn and represent nonlinear integral operators that appear in nonlinear integro-differential equations (IDEs) The LSTM-RNN representation of the nonlinear integral operator allows us to turn a system of nonlinear integro-differential equations into a system of ordinary differential equations. We show how this methodology can effectively solve the Dyson's equation for quantum many-body systems.
arXiv Detail & Related papers (2023-10-13T22:57:46Z)
Advancing Regular Language Reasoning in Linear Recurrent Neural Networks [56.11830645258106]
We study whether linear recurrent neural networks (LRNNs) can learn the hidden rules in training sequences. We propose a new LRNN equipped with a block-diagonal and input-dependent transition matrix. Experiments suggest that the proposed model is the only LRNN capable of performing length extrapolation on regular language tasks.
arXiv Detail & Related papers (2023-09-14T03:36:01Z)
Symmetric Tensor Networks for Generative Modeling and Constrained Combinatorial Optimization [72.41480594026815]
Constrained optimization problems abound in industry, from portfolio optimization to logistics. One of the major roadblocks in solving these problems is the presence of non-trivial hard constraints which limit the valid search space. In this work, we encode arbitrary integer-valued equality constraints of the form Ax=b, directly into U(1) symmetric networks (TNs) and leverage their applicability as quantum-inspired generative models.
arXiv Detail & Related papers (2022-11-16T18:59:54Z)
Recurrent Bilinear Optimization for Binary Neural Networks [58.972212365275595]
BNNs neglect the intrinsic bilinear relationship of real-valued weights and scale factors. Our work is the first attempt to optimize BNNs from the bilinear perspective. We obtain robust RBONNs, which show impressive performance over state-of-the-art BNNs on various models and datasets.
arXiv Detail & Related papers (2022-09-04T06:45:33Z)
On the well-spread property and its relation to linear regression [4.619541348328937]
We show that consistent recovery of the parameter vector in a robust linear regression model is information-theoretically impossible. We show that it is possible to efficiently certify whether a given $n$-by-$d$ matrix is well-spread if the number of observations is quadratic in the ambient dimension.
arXiv Detail & Related papers (2022-06-16T11:17:44Z)
Adaptive Discounting of Implicit Language Models in RNN-Transducers [33.63456351411599]
We show how a lightweight adaptive LM discounting technique can be used with any RNN-T architecture. We obtain up to 4% and 14% relative reductions in overall WER and rare word PER, respectively, on a conversational, code-mixed Hindi-English ASR task.
arXiv Detail & Related papers (2022-02-21T08:44:56Z)
Matrix Smoothing: A Regularization for DNN with Transition Matrix under Noisy Labels [54.585681272543056]
Training deep neural networks (DNNs) in the presence of noisy labels is an important and challenging task. Recent probabilistic methods directly apply transition matrix to DNN, neglect DNN's susceptibility to overfitting. We propose a novel method, in which a smoothed transition matrix is used for updating DNN, to restrict the overfitting.
arXiv Detail & Related papers (2020-03-26T13:49:37Z)
The Power of Linear Recurrent Neural Networks [1.124958340749622]
We show how autoregressive linear, i.e., linearly activated recurrent neural networks (LRNNs) can approximate any time-dependent function f(t) LRNNs outperform the previous state-of-the-art for the MSO task with a minimal number of units.
arXiv Detail & Related papers (2018-02-09T15:35:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.