Augmenting conformers with structured state-space sequence models for
online speech recognition
- URL: http://arxiv.org/abs/2309.08551v2
- Date: Wed, 27 Dec 2023 20:01:07 GMT
- Title: Augmenting conformers with structured state-space sequence models for
online speech recognition
- Authors: Haozhe Shan, Albert Gu, Zhong Meng, Weiran Wang, Krzysztof
Choromanski, Tara Sainath
- Abstract summary: Online speech recognition, where the model only accesses context to the left, is an important and challenging use case for ASR systems.
In this work, we investigate augmenting neural encoders for online ASR by incorporating structured state-space sequence models (S4)
We performed systematic ablation studies to compare variants of S4 models and propose two novel approaches that combine them with convolutions.
Our best model achieves WERs of 4.01%/8.53% on test sets from Librispeech, outperforming Conformers with extensively tuned convolution.
- Score: 41.444671189679994
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Online speech recognition, where the model only accesses context to the left,
is an important and challenging use case for ASR systems. In this work, we
investigate augmenting neural encoders for online ASR by incorporating
structured state-space sequence models (S4), a family of models that provide a
parameter-efficient way of accessing arbitrarily long left context. We
performed systematic ablation studies to compare variants of S4 models and
propose two novel approaches that combine them with convolutions. We found that
the most effective design is to stack a small S4 using real-valued recurrent
weights with a local convolution, allowing them to work complementarily. Our
best model achieves WERs of 4.01%/8.53% on test sets from Librispeech,
outperforming Conformers with extensively tuned convolution.
Related papers
- Multi-Convformer: Extending Conformer with Multiple Convolution Kernels [64.4442240213399]
We introduce Multi-Convformer that uses multiple convolution kernels within the convolution module of the Conformer in conjunction with gating.
Our model rivals existing Conformer variants such as CgMLP and E-Branchformer in performance, while being more parameter efficient.
We empirically compare our approach with Conformer and its variants across four different datasets and three different modelling paradigms and show up to 8% relative word error rate(WER) improvements.
arXiv Detail & Related papers (2024-07-04T08:08:12Z) - Efficient infusion of self-supervised representations in Automatic Speech Recognition [1.2972104025246092]
Self-supervised learned (SSL) models such as Wav2vec and HuBERT yield state-of-the-art results on speech-related tasks.
We propose two simple approaches that use framewise addition and (2) cross-attention mechanisms to efficiently incorporate the representations from the SSL model into the ASR architecture.
Our approach results in faster training and yields significant performance gains on the Librispeech and Tedlium datasets.
arXiv Detail & Related papers (2024-04-19T05:01:12Z) - A Neural State-Space Model Approach to Efficient Speech Separation [34.38911304755453]
We introduce S4M, a new efficient speech separation framework based on neural state-space models (SSM)
To extend the SSM technique into speech separation tasks, we first decompose the input mixture into multi-scale representations with different resolutions.
Experiments show that S4M performs comparably to other separation backbones in terms of SI-SDRi.
Our S4M-tiny model (1.8M parameters) even surpasses attention-based Sepformer (26.0M parameters) in noisy conditions with only 9.2 of multiply-accumulate operation (MACs)
arXiv Detail & Related papers (2023-05-26T13:47:11Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Structured State Space Decoder for Speech Recognition and Synthesis [9.354721572095272]
A structured state space model (S4) has been recently proposed, producing promising results for various long-sequence modeling tasks.
In this study, we applied S4 as a decoder for ASR and text-to-speech tasks by comparing it with the Transformer decoder.
For the ASR task, our experimental results demonstrate that the proposed model achieves a competitive word error rate (WER) of 1.88%/4.25%.
arXiv Detail & Related papers (2022-10-31T06:54:23Z) - Liquid Structural State-Space Models [106.74783377913433]
Liquid-S4 achieves an average performance of 87.32% on the Long-Range Arena benchmark.
On the full raw Speech Command recognition, dataset Liquid-S4 achieves 96.78% accuracy with a 30% reduction in parameter counts compared to S4.
arXiv Detail & Related papers (2022-09-26T18:37:13Z) - Squeezeformer: An Efficient Transformer for Automatic Speech Recognition [99.349598600887]
Conformer is the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture.
We propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes.
arXiv Detail & Related papers (2022-06-02T06:06:29Z) - Heterogeneous Reservoir Computing Models for Persian Speech Recognition [0.0]
Reservoir computing models (RC) models have been proven inexpensive to train, have vastly fewer parameters, and are compatible with emergent hardware technologies.
We propose heterogeneous single and multi-layer ESNs to create non-linear transformations of the inputs that capture temporal context at different scales.
arXiv Detail & Related papers (2022-05-25T09:15:15Z) - 4-bit Conformer with Native Quantization Aware Training for Speech
Recognition [13.997832593421577]
We propose to develop 4-bit ASR models with native quantization aware training, which leverages native integer operations to effectively optimize both training and inference.
We conducted two experiments on state-of-the-art Conformer-based ASR models to evaluate our proposed quantization technique.
For the first time investigated and revealed the viability of 4-bit quantization on a practical ASR system that is trained with large-scale datasets.
arXiv Detail & Related papers (2022-03-29T23:57:15Z) - A Conformer Based Acoustic Model for Robust Automatic Speech Recognition [63.242128956046024]
The proposed model builds on a state-of-the-art recognition system using a bi-directional long short-term memory (BLSTM) model with utterance-wise dropout and iterative speaker adaptation.
The Conformer encoder uses a convolution-augmented attention mechanism for acoustic modeling.
The proposed system is evaluated on the monaural ASR task of the CHiME-4 corpus.
arXiv Detail & Related papers (2022-03-01T20:17:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.