Distil-xLSTM: Learning Attention Mechanisms through Recurrent Structures
- URL: http://arxiv.org/abs/2503.18565v1
- Date: Mon, 24 Mar 2025 11:18:25 GMT
- Title: Distil-xLSTM: Learning Attention Mechanisms through Recurrent Structures
- Authors: Abdoul Majid O. Thiombiano, Brahim Hnich, Ali Ben Mrad, Mohamed Wiem Mkaouer,
- Abstract summary: We propose Distil-xLSTM, an xLSTM-based Small Language Model (SLM) trained by distilling knowledge from a Large Language Model (LLM)<n>Our Distil-xLSTM focuses on approximating a transformer-based model attention parametrization using its recurrent sequence mixing components and shows good results with minimal training.
- Score: 6.553328746906528
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The current era of Natural Language Processing (NLP) is dominated by Transformer models. However, novel architectures relying on recurrent mechanisms, such as xLSTM and Mamba, have been proposed as alternatives to attention-based models. Although computation is done differently than with the attention mechanism mechanism, these recurrent models yield good results and sometimes even outperform state-of-the-art attention-based models. In this work, we propose Distil-xLSTM, an xLSTM-based Small Language Model (SLM) trained by distilling knowledge from a Large Language Model (LLM) that shows promising results while being compute and scale efficient. Our Distil-xLSTM focuses on approximating a transformer-based model attention parametrization using its recurrent sequence mixing components and shows good results with minimal training.
Related papers
- Efficient Model Selection for Time Series Forecasting via LLMs [52.31535714387368]
We propose to leverage Large Language Models (LLMs) as a lightweight alternative for model selection.
Our method eliminates the need for explicit performance matrices by utilizing the inherent knowledge and reasoning capabilities of LLMs.
arXiv Detail & Related papers (2025-04-02T20:33:27Z) - Scalable Language Models with Posterior Inference of Latent Thought Vectors [52.63299874322121]
Latent-Thought Language Models (LTMs) incorporate explicit latent thought vectors that follow an explicit prior model in latent space.<n>LTMs possess additional scaling dimensions beyond traditional LLMs, yielding a structured design space.<n>LTMs significantly outperform conventional autoregressive models and discrete diffusion models in validation perplexity and zero-shot language modeling.
arXiv Detail & Related papers (2025-02-03T17:50:34Z) - Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach [31.654345704242512]
This paper introduces a novel, model-level judge-free self-improvement framework.<n>Our approach employs a controlled feedback mechanism while eliminating the need for MLLMs in the verification loop.<n>We achieve superior precision and recall with significantly lower computational demands.
arXiv Detail & Related papers (2024-11-26T00:44:37Z) - EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods.
EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z) - Pseudo-Label Training and Model Inertia in Neural Machine Translation [18.006833174265612]
neural machine translation (NMT) models are sensitive to small input changes and can show significant variation across re-training or incremental model updates.
This work studies a frequently used method in NMT, pseudo-label training (PLT), which is common to the related techniques of forwardtranslation or self-training.
While the effect of quality is well-documented, we highlight a lesser-known effect:PL can enhance a model's stability to model updates and input perturbations.
arXiv Detail & Related papers (2023-05-19T16:45:19Z) - How robust are pre-trained models to distribution shift? [82.08946007821184]
We show how spurious correlations affect the performance of popular self-supervised learning (SSL) and auto-encoder based models (AE)
We develop a novel evaluation scheme with the linear head trained on out-of-distribution (OOD) data, to isolate the performance of the pre-trained models from a potential bias of the linear head used for evaluation.
arXiv Detail & Related papers (2022-06-17T16:18:28Z) - Improving Rare Word Recognition with LM-aware MWER Training [50.241159623691885]
We introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework.
For the shallow fusion setup, we use LMs during both hypotheses generation and loss computation, and the LM-aware MWER-trained model achieves 10% relative improvement.
For the rescoring setup, we learn a small neural module to generate per-token fusion weights in a data-dependent manner.
arXiv Detail & Related papers (2022-04-15T17:19:41Z) - Bidirectional LSTM-CRF Attention-based Model for Chinese Word
Segmentation [2.3991565023534087]
We propose a Bidirectional LSTM-CRF Attention-based Model for Chinese word segmentation.
Our model performs better than the baseline methods modeling by other neural networks.
arXiv Detail & Related papers (2021-05-20T11:46:53Z) - Compressing LSTM Networks by Matrix Product Operators [7.395226141345625]
Long Short Term Memory(LSTM) models are the building blocks of many state-of-the-art natural language processing(NLP) and speech enhancement(SE) algorithms.
Here we introduce the MPO decomposition, which describes the local correlation of quantum states in quantum many-body physics.
We propose a matrix product operator(MPO) based neural network architecture to replace the LSTM model.
arXiv Detail & Related papers (2020-12-22T11:50:06Z) - Scaling Hidden Markov Language Models [118.55908381553056]
This work revisits the challenge of scaling HMMs to language modeling datasets.
We propose methods for scaling HMMs to massive state spaces while maintaining efficient exact inference, a compact parameterization, and effective regularization.
arXiv Detail & Related papers (2020-11-09T18:51:55Z) - Difference Attention Based Error Correction LSTM Model for Time Series
Prediction [3.7990471017645855]
We propose a novel model for time series prediction in which difference-attention LSTM model and error-correction LSTM model are respectively employed and combined in a cascade way.
With additional difference features and new principle learning framework, our model can improve the prediction accuracy in time series.
arXiv Detail & Related papers (2020-03-30T16:48:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.