Related papers: Demystifying the Token Dynamics of Deep Selective State Space Models

Demystifying the Token Dynamics of Deep Selective State Space Models

URL: http://arxiv.org/abs/2410.03292v2
Date: Fri, 07 Mar 2025 13:54:48 GMT
Title: Demystifying the Token Dynamics of Deep Selective State Space Models
Authors: Thieu N Vo, Tung D. Pham, Xin T. Tong, Tan Minh Nguyen,
Abstract summary: Selective state space models (SSM) have gained prominence for their effectiveness in modeling sequential data.<n>Despite their outstanding empirical performance, a comprehensive theoretical understanding of deep selective SSM remains elusive.<n>In this paper, we investigate the dynamical properties of tokens in a pre-trained Mamba model.
Score: 3.829322478948515
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Selective state space models (SSM), such as Mamba, have gained prominence for their effectiveness in modeling sequential data. Despite their outstanding empirical performance, a comprehensive theoretical understanding of deep selective SSM remains elusive, hindering their further development and adoption for applications that need high fidelity. In this paper, we investigate the dynamical properties of tokens in a pre-trained Mamba model. In particular, we derive the dynamical system governing the continuous-time limit of the Mamba model and characterize the asymptotic behavior of its solutions. In the one-dimensional case, we prove that only one of the following two scenarios happens: either all tokens converge to zero, or all tokens diverge to infinity. We provide criteria based on model parameters to determine when each scenario occurs. For the convergent scenario, we empirically verify that this scenario negatively impacts the model's performance. For the divergent scenario, we prove that different tokens will diverge to infinity at different rates, thereby contributing unequally to the updates during model training. Based on these investigations, we propose two refinements for the model: excluding the convergent scenario and reordering tokens based on their importance scores, both aimed at improving practical performance. Our experimental results validate these refinements, offering insights into enhancing Mamba's effectiveness in real-world applications.

Related papers

A Theoretical Analysis of Mamba's Training Dynamics: Filtering Relevant Features for Generalization in State Space Models [36.99162631444728]
We present a first-step analysis of generalization and learning dynamics for a simplified but representative Mamba block.<n>Our study adopts a structured data model with tokens that include both class-relevant and class-irrelevant patterns under token-level noise.<n>We prove that the model achieves guaranteed generalization by establishing non-asymptotic sample complexity and convergence rate bounds.
arXiv Detail & Related papers (2026-02-13T00:44:26Z)
D-Models and E-Models: Diversity-Stability Trade-offs in the Sampling Behavior of Large Language Models [91.21455683212224]
In large language models (LLMs), the probability of relevance for the next piece of information is linked to the probability of relevance for the next product.<n>But whether fine-grained sampling probabilities faithfully align with task requirements remains an open question.<n>We identify two model types: D-models, whose P_token exhibits large step-to-step variability and poor alignment with P_task; and E-models, whose P_token is more stable and better aligned with P_task.
arXiv Detail & Related papers (2026-01-25T14:59:09Z)
Mixture-of-Experts Models in Vision: Routing, Optimization, and Generalization [0.0]
We study MoE behavior in an image classification setting, focusing on predictive performance, expert utilization, and generalization.<n>We compare dense, SoftMoE, and SparseMoE classifier heads on the CIFAR10 dataset under comparable model capacity.<n>Both MoE variants achieve slightly higher validation accuracy than the dense baseline while maintaining balanced expert utilization through regularization.<n>We find that SoftMoE exhibits higher sharpness by these metrics, while Dense and SparseMoE lie in a similar curvature regime, despite all models achieving comparable generalization performance.
arXiv Detail & Related papers (2026-01-21T14:22:25Z)
Dynamical Properties of Tokens in Self-Attention and Effects of Positional Encoding [5.2482659629416535]
We characterize when tokens move closer to or farther from one another over time, depending on the model parameters.<n>We investigate how different forms of positional encoding -- specifically absolute and rotary -- affect these dynamical regimes.<n>Motivated by these insights, we propose simple refinements to Transformer architectures that mitigate convergence behavior in models with absolute or rotary positional encoding.
arXiv Detail & Related papers (2025-11-25T19:39:57Z)
Who Gets Credit or Blame? Attributing Accountability in Modern AI Systems [31.897288482449003]
AI systems are typically developed through multiple stages-pretraining, fine-tuning rounds, and subsequent adaptation or alignment.<n>This raises a critical question of accountability: when a deployed model succeeds or fails, which stage is responsible, and to what extent?<n>We pose the accountability attribution problem for tracing model behavior back to specific stages of the model development process.
arXiv Detail & Related papers (2025-05-30T19:27:39Z)
LARES: Latent Reasoning for Sequential Recommendation [96.26996622771593]
We present LARES, a novel and scalable LAtent REasoning framework for Sequential recommendation.<n>Our proposed approach employs a recurrent architecture that allows flexible expansion of reasoning depth without increasing parameter complexity.<n>Our framework exhibits seamless compatibility with existing advanced models, further improving their recommendation performance.
arXiv Detail & Related papers (2025-05-22T16:22:54Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Multi-Level Collaboration in Model Merging [56.31088116526825]
This paper explores the intrinsic connections between model merging and model ensembling. We find that even when previous restrictions are not met, there is still a way for model merging to attain a near-identical and superior performance similar to that of ensembling.
arXiv Detail & Related papers (2025-03-03T07:45:04Z)
[MASK] is All You Need [28.90875822599164]
We propose using discrete-state models to connect Masked Generative and Non-autoregressive Diffusion models. By leveraging [MASK] in discrete-state models, we can bridge Masked Generative and Non-autoregressive Diffusion models.
arXiv Detail & Related papers (2024-12-09T18:59:56Z)
Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LM) This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts. We develop a next-byte sampling algorithm that eliminates tokenization bias without requiring further training or optimization.
arXiv Detail & Related papers (2024-10-11T23:30:42Z)
When predict can also explain: few-shot prediction to select better neural latents [3.6218162133579703]
We present a novel prediction metric designed to yield latent variables that more accurately reflect the ground truth. In the absence of ground truth, we suggest a proxy measure to quantify extraneous dynamics.
arXiv Detail & Related papers (2024-05-23T10:48:30Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
A prediction and behavioural analysis of machine learning methods for modelling travel mode choice [0.26249027950824505]
We conduct a systematic comparison of different modelling approaches, across multiple modelling problems, in terms of the key factors likely to affect model choice. Results indicate that the models with the highest disaggregate predictive performance provide poorer estimates of behavioural indicators and aggregate mode shares. It is also observed that the MNL model performs robustly in a variety of situations, though ML techniques can improve the estimates of behavioural indices such as Willingness to Pay.
arXiv Detail & Related papers (2023-01-11T11:10:32Z)
Discrete Auto-regressive Variational Attention Models for Text Modeling [53.38382932162732]
Variational autoencoders (VAEs) have been widely applied for text modeling. They are troubled by two challenges: information underrepresentation and posterior collapse. We propose Discrete Auto-regressive Variational Attention Model (DAVAM) to address the challenges.
arXiv Detail & Related papers (2021-06-16T06:36:26Z)
A Twin Neural Model for Uplift [59.38563723706796]
Uplift is a particular case of conditional treatment effect modeling. We propose a new loss function defined by leveraging a connection with the Bayesian interpretation of the relative risk. We show our proposed method is competitive with the state-of-the-art in simulation setting and on real data from large scale randomized experiments.
arXiv Detail & Related papers (2021-05-11T16:02:39Z)
Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization [60.73540999409032]
We show that expressive autoregressive dynamics models generate different dimensions of the next state and reward sequentially conditioned on previous dimensions. We also show that autoregressive dynamics models are useful for offline policy optimization by serving as a way to enrich the replay buffer.
arXiv Detail & Related papers (2021-04-28T16:48:44Z)
Characterizing Fairness Over the Set of Good Models Under Selective Labels [69.64662540443162]
We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance. We provide tractable algorithms to compute the range of attainable group-level predictive disparities. We extend our framework to address the empirically relevant challenge of selectively labelled data.
arXiv Detail & Related papers (2021-01-02T02:11:37Z)
Models, Pixels, and Rewards: Evaluating Design Trade-offs in Visual Model-Based Reinforcement Learning [109.74041512359476]
We study a number of design decisions for the predictive model in visual MBRL algorithms. We find that a range of design decisions that are often considered crucial, such as the use of latent spaces, have little effect on task performance. We show how this phenomenon is related to exploration and how some of the lower-scoring models on standard benchmarks will perform the same as the best-performing models when trained on the same training data.
arXiv Detail & Related papers (2020-12-08T18:03:21Z)
Autoregressive Asymmetric Linear Gaussian Hidden Markov Models [1.332091725929965]
Asymmetric hidden Markov models provide a framework where the trend of the process can be expressed as a latent variable. We show how inference, hidden states decoding and parameter learning must be adapted to fit the proposed model.
arXiv Detail & Related papers (2020-10-27T08:58:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.