How Effective are State Space Models for Machine Translation?
- URL: http://arxiv.org/abs/2407.05489v1
- Date: Sun, 7 Jul 2024 20:21:49 GMT
- Title: How Effective are State Space Models for Machine Translation?
- Authors: Hugo Pitorro, Pavlo Vasylenko, Marcos Treviso, André F. T. Martins,
- Abstract summary: Transformers are the current architecture of choice for NLP, but their attention layers do not scale well to long contexts.
Recent works propose to replace attention with linear recurrent layers.
It remains unclear whether these models are competitive with transformers in machine translation.
- Score: 19.509486069758495
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers are the current architecture of choice for NLP, but their attention layers do not scale well to long contexts. Recent works propose to replace attention with linear recurrent layers -- this is the case for state space models, which enjoy efficient training and inference. However, it remains unclear whether these models are competitive with transformers in machine translation (MT). In this paper, we provide a rigorous and comprehensive experimental comparison between transformers and linear recurrent models for MT. Concretely, we experiment with RetNet, Mamba, and hybrid versions of Mamba which incorporate attention mechanisms. Our findings demonstrate that Mamba is highly competitive with transformers on sentence and paragraph-level datasets, where in the latter both models benefit from shifting the training distribution towards longer sequences. Further analysis show that integrating attention into Mamba improves translation quality, robustness to sequence length extrapolation, and the ability to recall named entities.
Related papers
- The Mamba in the Llama: Distilling and Accelerating Hybrid Models [76.64055251296548]
We show that it is feasible to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources.
The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer in chat benchmarks.
arXiv Detail & Related papers (2024-08-27T17:56:11Z) - An Empirical Study of Mamba-based Language Models [69.74383762508805]
Selective state-space models (SSMs) like Mamba overcome some shortcomings of Transformers.
We present a direct comparison between 8B-context Mamba, Mamba-2, and Transformer models trained on the same datasets.
We find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks.
arXiv Detail & Related papers (2024-06-12T05:25:15Z) - Parallelizing Linear Transformers with the Delta Rule over Sequence Length [49.88826673324244]
This work describes a hardware-efficient algorithm for training linear transformers with the delta rule.
We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines.
arXiv Detail & Related papers (2024-06-10T17:24:42Z) - Demystify Mamba in Vision: A Linear Attention Perspective [72.93213667713493]
Mamba is an effective state space model with linear computation complexity.
We show that Mamba shares surprising similarities with linear attention Transformer.
We propose a Mamba-Like Linear Attention (MLLA) model by incorporating the merits of these two key designs into linear attention.
arXiv Detail & Related papers (2024-05-26T15:31:09Z) - RankMamba: Benchmarking Mamba's Document Ranking Performance in the Era of Transformers [2.8554857235549753]
Transformer architecture's core mechanism -- attention requires $O(n2)$ time complexity in training and $O(n)$ time complexity in inference.
A notable model structure -- Mamba, which is based on state space models, has achieved transformer-equivalent performance in sequence modeling tasks.
We find that Mamba models achieve competitive performance compared to transformer-based models with the same training recipe.
arXiv Detail & Related papers (2024-03-27T06:07:05Z) - Is Mamba Effective for Time Series Forecasting? [30.85990093479062]
We propose a Mamba-based model named Simple-Mamba (S-Mamba) for time series forecasting.
Specifically, we tokenize the time points of each variate autonomously via a linear layer.
Experiments on thirteen public datasets prove that S-Mamba maintains low computational overhead and achieves leading performance.
arXiv Detail & Related papers (2024-03-17T08:50:44Z) - The Hidden Attention of Mamba Models [54.50526986788175]
The Mamba layer offers an efficient selective state space model (SSM) that is highly effective in modeling multiple domains.
We show that such models can be viewed as attention-driven models.
This new perspective enables us to empirically and theoretically compare the underlying mechanisms to that of the self-attention layers in transformers.
arXiv Detail & Related papers (2024-03-03T18:58:21Z) - Is Mamba Capable of In-Context Learning? [63.682741783013306]
State of the art foundation models such as GPT-4 perform surprisingly well at in-context learning (ICL)
This work provides empirical evidence that Mamba, a newly proposed state space model, has similar ICL capabilities.
arXiv Detail & Related papers (2024-02-05T16:39:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.