Related papers: Why are state-space models more expressive than $n$-gram models?

Why are state-space models more expressive than $n$-gram models?

URL: http://arxiv.org/abs/2306.17184v2
Date: Sun, 15 Dec 2024 00:24:59 GMT
Title: Why are state-space models more expressive than $n$-gram models?
Authors: Vinoth Nandakumar, Qiang Qu, Peng Mi, Tongliang Liu,
Abstract summary: We construct state space language models that can solve the next-word prediction task for languages generated from $n$-gram rules.<n>Our proof shows how SSMs can encode $n$-gram rules using new theoretical results on their memorization capacity.<n>We conduct experiments with a small dataset generated from $n$-gram rules to show how our framework can be applied.
Score: 51.823427608117626
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in recurrent neural networks (RNNs) have reinvigorated interest in their application to natural language processing tasks, particularly with the development of more efficient and parallelizable variants known as state space models (SSMs), which have shown competitive performance against transformer models while maintaining a lower memory footprint. While RNNs and SSMs (e.g., Mamba) have been empirically more successful than rule-based systems based on $n$-gram models, a rigorous theoretical explanation for this success has not yet been developed, as it is unclear how these models encode the combinatorial rules that govern the next-word prediction task. In this paper, we construct state space language models that can solve the next-word prediction task for languages generated from $n$-gram rules, thereby showing that the former are more expressive. Our proof shows how SSMs can encode $n$-gram rules using new theoretical results on their memorization capacity, and demonstrates how their context window can be controlled by restricting the spectrum of the hidden weight matrix. We conduct experiments with a small dataset generated from $n$-gram rules to show how our framework can be applied to SSMs and RNNs obtained through gradient-based optimization.

Related papers

Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo [90.78001821963008]
A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. We develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC) Our system builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language.
arXiv Detail & Related papers (2025-04-17T17:49:40Z)
Implicit Language Models are RNNs: Balancing Parallelization and Expressivity [4.332158627306896]
State-space models (SSMs) and transformers dominate the language modeling landscape. We propose implicit SSMs, which iterate a transformation until convergence to a fixed point. Our approach demonstrates superior state-tracking capabilities on regular languages, surpassing transformers and SSMs.
arXiv Detail & Related papers (2025-02-10T19:59:31Z)
Interpretable Language Modeling via Induction-head Ngram Models [74.26720927767398]
We propose Induction-head ngram models (Induction-Gram) to bolster modern ngram models with a hand-engineered "induction head" This induction head uses a custom neural similarity metric to efficiently search the model's input context for potential next-word completions. Experiments show that this simple method significantly improves next-word prediction over baseline interpretable models.
arXiv Detail & Related papers (2024-10-31T12:33:26Z)
Making Pre-trained Language Models Great on Tabular Prediction [50.70574370855663]
The transferability of deep neural networks (DNNs) has made significant progress in image and language processing. We present TP-BERTa, a specifically pre-trained LM for tabular data prediction. A novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names.
arXiv Detail & Related papers (2024-03-04T08:38:56Z)
Theoretical Foundations of Deep Selective State-Space Models [13.971499161967083]
Deep SSMs demonstrate outstanding performance across a diverse set of domains. Recent developments show that if the linear recurrence powering SSMs allows for multiplicative interactions between inputs and hidden states. We show that when random linear recurrences are equipped with simple input-controlled transitions, then the hidden state is provably a low-dimensional projection of a powerful mathematical object.
arXiv Detail & Related papers (2024-02-29T11:20:16Z)
LlaMaVAE: Guiding Large Language Model Generation via Continuous Latent Sentence Spaces [1.529963465178546]
We present LlaMaVAE, which combines expressive encoder and decoder models (sentenceT5 and LlaMA) with a VAE architecture to provide better text generation control to large language models (LLMs) Experimental results reveal that LlaMaVAE can outperform the previous state-of-the-art VAE language model, Optimus, across various tasks.
arXiv Detail & Related papers (2023-12-20T17:25:23Z)
Meta predictive learning model of languages in neural circuits [2.5690340428649328]
We propose a mean-field learning model within the predictive coding framework. Our model reveals that most of the connections become deterministic after learning. Our model provides a starting point to investigate the connection among brain computation, next-token prediction and general intelligence.
arXiv Detail & Related papers (2023-09-08T03:58:05Z)
Mastering Symbolic Operations: Augmenting Language Models with Compiled Neural Networks [48.14324895100478]
"Neural architecture" integrates compiled neural networks (CoNNs) into a standard transformer. CoNNs are neural modules designed to explicitly encode rules through artificially generated attention weights. Experiments demonstrate superiority of our approach over existing techniques in terms of length generalization, efficiency, and interpretability for symbolic operations.
arXiv Detail & Related papers (2023-04-04T09:50:07Z)
Word class representations spontaneously emerge in a deep neural network trained on next word prediction [7.240611820374677]
How do humans learn language, and can the first language be learned at all? These fundamental questions are still hotly debated. In particular, we train an artificial deep neural network on predicting the next word. We find that the internal representations of nine-word input sequences cluster according to the word class of the tenth word to be predicted as output.
arXiv Detail & Related papers (2023-02-15T11:02:50Z)
Deep Learning Models to Study Sentence Comprehension in the Human Brain [0.1503974529275767]
Recent artificial neural networks that process natural language achieve unprecedented performance in tasks requiring sentence-level understanding. We review works that compare these artificial language models with human brain activity and we assess the extent to which this approach has improved our understanding of the neural processes involved in natural language comprehension.
arXiv Detail & Related papers (2023-01-16T10:31:25Z)
Dependency-based Mixture Language Models [53.152011258252315]
We introduce the Dependency-based Mixture Language Models. In detail, we first train neural language models with a novel dependency modeling objective. We then formulate the next-token probability by mixing the previous dependency modeling probability distributions with self-attention.
arXiv Detail & Related papers (2022-03-19T06:28:30Z)
Language Models are not Models of Language [0.0]
Transfer learning has enabled large deep learning neural networks trained on the language modeling task to vastly improve performance. We argue that the term language model is misleading because deep learning models are not theoretical models of language.
arXiv Detail & Related papers (2021-12-13T22:39:46Z)
Long-range and hierarchical language predictions in brains and algorithms [82.81964713263483]
We show that while deep language algorithms are optimized to predict adjacent words, the human brain would be tuned to make long-range and hierarchical predictions. This study strengthens predictive coding theory and suggests a critical role of long-range and hierarchical predictions in natural language processing.
arXiv Detail & Related papers (2021-11-28T20:26:07Z)
Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages. We infer this distribution from a sample of typologically diverse training languages. We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z)
You Can Do Better! If You Elaborate the Reason When Making Prediction [13.658942796267015]
This paper proposes a novel neural predictive framework coupled with large pre-trained language models to make a prediction and generate its corresponding explanation simultaneously. We conducted a preliminary empirical study on Chinese medical multiple-choice question answering, English natural language inference and commonsense question answering tasks. The proposed method also achieves improved prediction accuracy on three datasets, which indicates that making predictions can benefit from generating the explanation in the decision process.
arXiv Detail & Related papers (2021-03-27T14:55:19Z)
The Neural Coding Framework for Learning Generative Models [91.0357317238509]
We propose a novel neural generative model inspired by the theory of predictive processing in the brain. In a similar way, artificial neurons in our generative model predict what neighboring neurons will do, and adjust their parameters based on how well the predictions matched reality.
arXiv Detail & Related papers (2020-12-07T01:20:38Z)
Beyond Graph Neural Networks with Lifted Relational Neural Networks [14.63152363481139]
We demonstrate a declarative differentiable programming framework based on the language of Lifted Neural Networks. Small parameterized programs are used to encode learning. We show how this idea can be used for an efficient encoding of a diverse range of advanced neural networks.
arXiv Detail & Related papers (2020-07-13T10:10:58Z)
Mechanisms for Handling Nested Dependencies in Neural-Network Language Models and Humans [75.15855405318855]
We studied whether a modern artificial neural network trained with "deep learning" methods mimics a central aspect of human sentence processing. Although the network was solely trained to predict the next word in a large corpus, analysis showed the emergence of specialized units that successfully handled local and long-distance syntactic agreement. We tested the model's predictions in a behavioral experiment where humans detected violations in number agreement in sentences with systematic variations in the singular/plural status of multiple nouns.
arXiv Detail & Related papers (2020-06-19T12:00:05Z)
Logical Natural Language Generation from Open-Domain Tables [107.04385677577862]
We propose a new task where a model is tasked with generating natural language statements that can be emphlogically entailed by the facts. To facilitate the study of the proposed logical NLG problem, we use the existing TabFact dataset citechen 2019tabfact featured with a wide range of logical/symbolic inferences. The new task poses challenges to the existing monotonic generation frameworks due to the mismatch between sequence order and logical order.
arXiv Detail & Related papers (2020-04-22T06:03:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.