Related papers: xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity

xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity

URL: http://arxiv.org/abs/2510.02228v1
Date: Thu, 02 Oct 2025 17:14:34 GMT
Title: xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity
Authors: Maximilian Beck, Kajetan Schweighofer, Sebastian Böck, Sebastian Lehner, Sepp Hochreiter,
Abstract summary: Scaling laws play a central role in the success of Large Language Models.<n>Recent alternatives such as xLSTM offer linear complexity with respect to context length.<n>xLSTM's advantage widens as training and inference contexts grow.
Score: 22.40851170527
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Scaling laws play a central role in the success of Large Language Models (LLMs), enabling the prediction of model performance relative to compute budgets prior to training. While Transformers have been the dominant architecture, recent alternatives such as xLSTM offer linear complexity with respect to context length while remaining competitive in the billion-parameter regime. We conduct a comparative investigation on the scaling behavior of Transformers and xLSTM along the following lines, providing insights to guide future model design and deployment. First, we study the scaling behavior for xLSTM in compute-optimal and over-training regimes using both IsoFLOP and parametric fit approaches on a wide range of model sizes (80M-7B) and number of training tokens (2B-2T). Second, we examine the dependence of optimal model sizes on context length, a pivotal aspect that was largely ignored in previous work. Finally, we analyze inference-time scaling characteristics. Our findings reveal that in typical LLM training and inference scenarios, xLSTM scales favorably compared to Transformers. Importantly, xLSTM's advantage widens as training and inference contexts grow.

Related papers

Deep Learning for Financial Time Series: A Large-Scale Benchmark of Risk-Adjusted Performance [4.889402269887708]
We present a large scale benchmark of modern deep learning architectures for a financial time series prediction and position sizing task.<n>We evaluate linear models, recurrent networks, transformer based architectures, state space models, and recent sequence representation approaches.<n>We find that models explicitly designed to learn rich temporal representations consistently outperform linear benchmarks and generic deep learning models.
arXiv Detail & Related papers (2026-03-02T12:52:50Z)
How to Set the Learning Rate for Large-Scale Pre-training? [73.03133634525635]
We formalize this investigation into two distinct research paradigms: Fitting and Transfer.<n>Within the Fitting Paradigm, we introduce a Scaling Law for search factor, effectively reducing the search complexity from O(n3) to O(n*C_D*C_) via predictive modeling.<n>We extend the principles of $$Transfer to the Mixture of Experts (MoE) architecture, broadening its applicability to encompass model depth, weight decay, and token horizons.
arXiv Detail & Related papers (2026-01-08T15:55:13Z)
AF-MAT: Aspect-aware Flip-and-Fuse xLSTM for Aspect-based Sentiment Analysis [0.6498237940960344]
We introduce Aspect-aware Flip-and-Fuse xLSTM (AF-MAT), a framework that leverages xLSTM's strengths.<n> AF-MAT features an Aspect-aware matrix LSTM mechanism that introduces a dedicated aspect gate, enabling the model to selectively emphasize tokens semantically relevant to the target aspect during memory updates.<n>Experiments on three benchmark datasets that AF-MAT outperforms state-of-the-art baselines, achieving higher accuracy in ABSA tasks.
arXiv Detail & Related papers (2025-07-01T22:21:33Z)
Routing Mamba: Scaling State Space Models with Mixture-of-Experts Projection [88.47928738482719]
Linear State Space Models (SSMs) offer remarkable performance gains in sequence modeling.<n>Recent advances, such as Mamba, further enhance SSMs with input-dependent gating and hardware-aware implementations.<n>We introduce Routing Mamba (RoM), a novel approach that scales SSM parameters using sparse mixtures of linear projection experts.
arXiv Detail & Related papers (2025-06-22T19:26:55Z)
Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo [22.7130140114906]
We study the scaling law behavior of DiLoCo when training LLMs under a fixed compute budget.<n>We find that DiLoCo scales both predictably and robustly with model size.<n>When well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes.
arXiv Detail & Related papers (2025-03-12T20:04:38Z)
Cost-Optimal Grouped-Query Attention for Long-Context Modeling [45.981681856747365]
Grouped-Query Attention (GQA) is a widely adopted strategy for reducing the computational cost of attention layers in large language models.<n>We analyze the relationship among context length, model size, GQA configuration, and model loss.<n>We propose a recipe for deriving cost-optimal GQA configurations.
arXiv Detail & Related papers (2025-03-12T17:50:42Z)
Predictable Scale: Part I, Step Law -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining [59.369484219304866]
We conduct an unprecedented empirical investigation training over 3,700 Large Language Models (LLMs) from scratch across 100 trillion tokens.<n>We establish a universal Scaling Law for hyperparameter optimization in LLM Pre-training, called Step Law.<n>Our estimated optima deviates from the global best performance found via exhaustive search by merely 0.094% on the test set.
arXiv Detail & Related papers (2025-03-06T18:58:29Z)
Latent Thought Models with Variational Bayes Inference-Time Computation [52.63299874322121]
Latent Thought Models (LTMs) incorporate explicit latent thought vectors that follow an explicit prior model in latent space.<n>LTMs demonstrate superior sample and parameter efficiency compared to autoregressive models and discrete diffusion models.
arXiv Detail & Related papers (2025-02-03T17:50:34Z)
Optimizing Sequential Recommendation Models with Scaling Laws and Approximate Entropy [104.48511402784763]
Performance Law for SR models aims to theoretically investigate and model the relationship between model performance and data quality.<n>We propose Approximate Entropy (ApEn) to assess data quality, presenting a more nuanced approach compared to traditional data quantity metrics.
arXiv Detail & Related papers (2024-11-30T10:56:30Z)
Linearizing Large Language Models [26.94551511277412]
We present a method to uptrain existing large pre-trained transformers into Recurrent Neural Networks (RNNs) with a modest compute budget. We find that our linearization technique leads to competitive performance on standard benchmarks, but we identify persistent in-context learning and long-context modeling shortfalls for even the largest linear models.
arXiv Detail & Related papers (2024-05-10T17:59:08Z)
Bi-Mamba+: Bidirectional Mamba for Time Series Forecasting [5.166854384000439]
Long-term time series forecasting (LTSF) provides longer insights into future trends and patterns. Recently, a new state space model (SSM) named Mamba is proposed. With the selective capability on input data and the hardware-aware parallel computing algorithm, Mamba has shown great potential in balancing predicting performance and computational efficiency.
arXiv Detail & Related papers (2024-04-24T09:45:48Z)
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.