Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues
- URL: http://arxiv.org/abs/2411.12537v1
- Date: Tue, 19 Nov 2024 14:35:38 GMT
- Title: Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues
- Authors: Riccardo Grazzi, Julien Siems, Jörg K. H. Franke, Arber Zela, Frank Hutter, Massimiliano Pontil,
- Abstract summary: Linear Recurrent Neural Networks (LRNNs) have emerged as efficient alternatives to Transformers in large language modeling.
LRNNs struggle to perform state-tracking which may impair performance in tasks such as code evaluation or tracking a chess game.
Our work enhances the expressivity of modern LRNNs, broadening their applicability without changing the cost of training or inference.
- Score: 65.41946981594567
- License:
- Abstract: Linear Recurrent Neural Networks (LRNNs) such as Mamba, RWKV, GLA, mLSTM, and DeltaNet have emerged as efficient alternatives to Transformers in large language modeling, offering linear scaling with sequence length and improved training efficiency. However, LRNNs struggle to perform state-tracking which may impair performance in tasks such as code evaluation or tracking a chess game. Even parity, the simplest state-tracking task, which non-linear RNNs like LSTM handle effectively, cannot be solved by current LRNNs. Recently, Sarrof et al. (2024) demonstrated that the failure of LRNNs like Mamba to solve parity stems from restricting the value range of their diagonal state-transition matrices to $[0, 1]$ and that incorporating negative values can resolve this issue. We extend this result to non-diagonal LRNNs, which have recently shown promise in models such as DeltaNet. We prove that finite precision LRNNs with state-transition matrices having only positive eigenvalues cannot solve parity, while complex eigenvalues are needed to count modulo $3$. Notably, we also prove that LRNNs can learn any regular language when their state-transition matrices are products of identity minus vector outer product matrices, each with eigenvalues in the range $[-1, 1]$. Our empirical results confirm that extending the eigenvalue range of models like Mamba and DeltaNet to include negative values not only enables them to solve parity but consistently improves their performance on state-tracking tasks. Furthermore, pre-training LRNNs with an extended eigenvalue range for language modeling achieves comparable performance and stability while showing promise on code and math data. Our work enhances the expressivity of modern LRNNs, broadening their applicability without changing the cost of training or inference.
Related papers
- Were RNNs All We Needed? [53.393497486332]
We revisit traditional recurrent neural networks (RNNs) from over a decade ago.
We show that by removing their hidden state dependencies from their input, forget, and update gates, LSTMs and GRUs no longer need to BPTT and can be efficiently trained in parallel.
arXiv Detail & Related papers (2024-10-02T03:06:49Z) - Learning nonlinear integral operators via Recurrent Neural Networks and
its application in solving Integro-Differential Equations [4.011446845089061]
We learn and represent nonlinear integral operators that appear in nonlinear integro-differential equations (IDEs)
The LSTM-RNN representation of the nonlinear integral operator allows us to turn a system of nonlinear integro-differential equations into a system of ordinary differential equations.
We show how this methodology can effectively solve the Dyson's equation for quantum many-body systems.
arXiv Detail & Related papers (2023-10-13T22:57:46Z) - Advancing Regular Language Reasoning in Linear Recurrent Neural Networks [56.11830645258106]
We study whether linear recurrent neural networks (LRNNs) can learn the hidden rules in training sequences.
We propose a new LRNN equipped with a block-diagonal and input-dependent transition matrix.
Experiments suggest that the proposed model is the only LRNN capable of performing length extrapolation on regular language tasks.
arXiv Detail & Related papers (2023-09-14T03:36:01Z) - Symmetric Tensor Networks for Generative Modeling and Constrained
Combinatorial Optimization [72.41480594026815]
Constrained optimization problems abound in industry, from portfolio optimization to logistics.
One of the major roadblocks in solving these problems is the presence of non-trivial hard constraints which limit the valid search space.
In this work, we encode arbitrary integer-valued equality constraints of the form Ax=b, directly into U(1) symmetric networks (TNs) and leverage their applicability as quantum-inspired generative models.
arXiv Detail & Related papers (2022-11-16T18:59:54Z) - Recurrent Bilinear Optimization for Binary Neural Networks [58.972212365275595]
BNNs neglect the intrinsic bilinear relationship of real-valued weights and scale factors.
Our work is the first attempt to optimize BNNs from the bilinear perspective.
We obtain robust RBONNs, which show impressive performance over state-of-the-art BNNs on various models and datasets.
arXiv Detail & Related papers (2022-09-04T06:45:33Z) - Adaptive Discounting of Implicit Language Models in RNN-Transducers [33.63456351411599]
We show how a lightweight adaptive LM discounting technique can be used with any RNN-T architecture.
We obtain up to 4% and 14% relative reductions in overall WER and rare word PER, respectively, on a conversational, code-mixed Hindi-English ASR task.
arXiv Detail & Related papers (2022-02-21T08:44:56Z) - Matrix Smoothing: A Regularization for DNN with Transition Matrix under
Noisy Labels [54.585681272543056]
Training deep neural networks (DNNs) in the presence of noisy labels is an important and challenging task.
Recent probabilistic methods directly apply transition matrix to DNN, neglect DNN's susceptibility to overfitting.
We propose a novel method, in which a smoothed transition matrix is used for updating DNN, to restrict the overfitting.
arXiv Detail & Related papers (2020-03-26T13:49:37Z) - The Power of Linear Recurrent Neural Networks [1.124958340749622]
We show how autoregressive linear, i.e., linearly activated recurrent neural networks (LRNNs) can approximate any time-dependent function f(t)
LRNNs outperform the previous state-of-the-art for the MSO task with a minimal number of units.
arXiv Detail & Related papers (2018-02-09T15:35:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.