Multi-Head State Space Model for Speech Recognition
- URL: http://arxiv.org/abs/2305.12498v2
- Date: Thu, 25 May 2023 21:55:58 GMT
- Title: Multi-Head State Space Model for Speech Recognition
- Authors: Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan
Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer,
Mark J. F. Gales
- Abstract summary: State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks.
In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms.
As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus.
- Score: 44.04124537862432
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: State space models (SSMs) have recently shown promising results on
small-scale sequence and language modelling tasks, rivalling and outperforming
many attention-based approaches. In this paper, we propose a multi-head state
space (MH-SSM) architecture equipped with special gating mechanisms, where
parallel heads are taught to learn local and global temporal dynamics on
sequence data. As a drop-in replacement for multi-head attention in transformer
encoders, this new model significantly outperforms the transformer transducer
on the LibriSpeech speech recognition corpus. Furthermore, we augment the
transformer block with MH-SSMs layers, referred to as the Stateformer,
achieving state-of-the-art performance on the LibriSpeech task, with word error
rates of 1.76\%/4.37\% on the development and 1.91\%/4.36\% on the test sets
without using an external language model.
Related papers
- Efficient Machine Translation with a BiLSTM-Attention Approach [0.0]
This paper proposes a novel Seq2Seq model aimed at improving translation quality while reducing the storage space required by the model.
The model employs a Bidirectional Long Short-Term Memory network (Bi-LSTM) as the encoder to capture the context information of the input sequence.
Compared to the current mainstream Transformer model, our model achieves superior performance on the WMT14 machine translation dataset.
arXiv Detail & Related papers (2024-10-29T01:12:50Z) - SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units.
Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering.
SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z) - Cross-Language Speech Emotion Recognition Using Multimodal Dual
Attention Transformers [5.538923337818467]
State-of-the-art systems are unable to achieve improved performance in cross-language settings.
We propose a Multimodal Dual Attention Transformer model to improve cross-language SER.
arXiv Detail & Related papers (2023-06-23T22:38:32Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Mega: Moving Average Equipped Gated Attention [150.3124713793503]
Mega is a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average.
We show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models.
arXiv Detail & Related papers (2022-09-21T20:52:17Z) - Relaxed Attention for Transformer Models [29.896876421216373]
In this paper, we explore relaxed attention, a simple and easy-to-implement smoothing of the attention weights.
We show that relaxed attention provides regularization when applied to the self-attention layers in the encoder.
We demonstrate the benefit of relaxed attention across several tasks with clear improvement in combination with recent benchmark approaches.
arXiv Detail & Related papers (2022-09-20T14:10:28Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Relaxed Attention: A Simple Method to Boost Performance of End-to-End
Automatic Speech Recognition [27.530537066239116]
We introduce the concept of relaxed attention, which is a gradual injection of a uniform distribution to the encoder-decoder attention weights during training.
We find that transformers trained with relaxed attention outperform the standard baseline models consistently during decoding with external language models.
On WSJ, we set a new benchmark for transformer-based end-to-end speech recognition with a word error rate of 3.65%, outperforming state of the art (4.20%) by 13.1% relative.
arXiv Detail & Related papers (2021-07-02T21:01:17Z) - Dual-decoder Transformer for Joint Automatic Speech Recognition and
Multilingual Speech Translation [71.54816893482457]
We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST)
Our models are based on the original Transformer architecture but consist of two decoders, each responsible for one task (ASR or ST)
arXiv Detail & Related papers (2020-11-02T04:59:50Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.