ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech
Recognition
- URL: http://arxiv.org/abs/2005.10469v1
- Date: Thu, 21 May 2020 05:18:34 GMT
- Title: ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech
Recognition
- Authors: Jing Pan, Joshua Shapiro, Jeremy Wohlwend, Kyu J. Han, Tao Lei and Tao
Ma
- Abstract summary: We present state-of-the-art (SOTA) performance on the LibriSpeech corpus with two novel neural network architectures.
In the hybrid ASR framework, the multistream CNN acoustic model processes an input of speech frames in multiple parallel pipelines.
We further improve the performance via N-best rescoring using a 24-layer self-attentive SRU language model, achieving WERs of 1.75% on test-clean and 4.46% on test-other.
- Score: 21.554020483837096
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we present state-of-the-art (SOTA) performance on the
LibriSpeech corpus with two novel neural network architectures, a multistream
CNN for acoustic modeling and a self-attentive simple recurrent unit (SRU) for
language modeling. In the hybrid ASR framework, the multistream CNN acoustic
model processes an input of speech frames in multiple parallel pipelines where
each stream has a unique dilation rate for diversity. Trained with the
SpecAugment data augmentation method, it achieves relative word error rate
(WER) improvements of 4% on test-clean and 14% on test-other. We further
improve the performance via N-best rescoring using a 24-layer self-attentive
SRU language model, achieving WERs of 1.75% on test-clean and 4.46% on
test-other.
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - End-to-End Multi-speaker ASR with Independent Vector Analysis [80.83577165608607]
We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition.
We propose a paradigm for joint source separation and dereverberation based on the independent vector analysis (IVA) paradigm.
arXiv Detail & Related papers (2022-04-01T05:45:33Z) - Multi-turn RNN-T for streaming recognition of multi-party speech [2.899379040028688]
This work takes real-time applicability as the first priority in model design and addresses a few challenges in previous work on multi-speaker recurrent neural network transducer (MS-RNN-T)
We introduce on-the-fly overlapping speech simulation during training, yielding 14% relative word error rate (WER) improvement on LibriSpeechMix test set.
We propose a novel multi-turn RNN-T (MT-RNN-T) model with an overlap-based target arrangement strategy that generalizes to an arbitrary number of speakers without changes in the model architecture.
arXiv Detail & Related papers (2021-12-19T17:22:58Z) - Exploring Deep Hybrid Tensor-to-Vector Network Architectures for
Regression Based Speech Enhancement [53.47564132861866]
We find that a hybrid architecture, namely CNN-TT, is capable of maintaining a good quality performance with a reduced model parameter size.
CNN-TT is composed of several convolutional layers at the bottom for feature extraction to improve speech quality.
arXiv Detail & Related papers (2020-07-25T22:21:05Z) - Multistream CNN for Robust Acoustic Modeling [17.155489701060542]
Multistream CNN is a novel neural network architecture for robust acoustic modeling in speech recognition tasks.
We show consistent improvements against Kaldi's best TDNN-F model across various data sets.
In terms of real-time factor, multistream CNN outperforms the baseline TDNN-F by 15%.
arXiv Detail & Related papers (2020-05-21T05:26:15Z) - Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR)
We propose the convolution-augmented transformer for speech recognition, named Conformer.
On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z) - Multiresolution and Multimodal Speech Recognition with Transformers [22.995102995029576]
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture.
We focus on the scene context provided by the visual information, to ground the ASR.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.
arXiv Detail & Related papers (2020-04-29T09:32:11Z) - End-to-End Multi-speaker Speech Recognition with Transformer [88.22355110349933]
We replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture.
We also modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation.
arXiv Detail & Related papers (2020-02-10T16:29:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.