Echo State Speech Recognition
- URL: http://arxiv.org/abs/2102.09114v1
- Date: Thu, 18 Feb 2021 02:04:14 GMT
- Title: Echo State Speech Recognition
- Authors: Harsh Shrivastava, Ankush Garg, Yuan Cao, Yu Zhang, Tara Sainath
- Abstract summary: We propose automatic speech recognition models inspired by echo state network (ESN)
We show that model quality does not drop even when the decoder is fully randomized.
Such models can be trained more efficiently as the decoders do not require to be updated.
- Score: 10.084532635965513
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose automatic speech recognition (ASR) models inspired by echo state
network (ESN), in which a subset of recurrent neural networks (RNN) layers in
the models are randomly initialized and untrained. Our study focuses on RNN-T
and Conformer models, and we show that model quality does not drop even when
the decoder is fully randomized. Furthermore, such models can be trained more
efficiently as the decoders do not require to be updated. By contrast,
randomizing encoders hurts model quality, indicating that optimizing encoders
and learn proper representations for acoustic inputs are more vital for speech
recognition. Overall, we challenge the common practice of training ASR models
for all components, and demonstrate that ESN-based models can perform equally
well but enable more efficient training and storage than fully-trainable
counterparts.
Related papers
- Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations [16.577870835480585]
We present a comprehensive analysis on building ASR systems with discrete codes.
We investigate different methods for training such as quantization schemes and time-domain vs spectral feature encodings.
We introduce a pipeline that outperforms Encodec at similar bit-rate.
arXiv Detail & Related papers (2024-07-03T20:51:41Z) - 4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders [53.297697898510194]
We propose a joint modeling scheme where four decoders share the same encoder -- we refer to this as 4D modeling.
To efficiently train the 4D model, we introduce a two-stage training strategy that stabilizes multitask learning.
In addition, we propose three novel one-pass beam search algorithms by combining three decoders.
arXiv Detail & Related papers (2024-06-05T05:18:20Z) - Neural Network Parameter Diffusion [50.85251415173792]
Diffusion models have achieved remarkable success in image and video generation.
In this work, we demonstrate that diffusion models can also.
generate high-performing neural network parameters.
arXiv Detail & Related papers (2024-02-20T16:59:03Z) - How neural networks learn to classify chaotic time series [77.34726150561087]
We study the inner workings of neural networks trained to classify regular-versus-chaotic time series.
We find that the relation between input periodicity and activation periodicity is key for the performance of LKCNN models.
arXiv Detail & Related papers (2023-06-04T08:53:27Z) - Visualising and Explaining Deep Learning Models for Speech Quality
Prediction [0.0]
The non-intrusive speech quality prediction model NISQA is analyzed in this paper.
It is composed of a convolutional neural network (CNN) and a recurrent neural network (RNN)
arXiv Detail & Related papers (2021-12-12T12:50:03Z) - Distributionally Robust Recurrent Decoders with Random Network
Distillation [93.10261573696788]
We propose a method based on OOD detection with Random Network Distillation to allow an autoregressive language model to disregard OOD context during inference.
We apply our method to a GRU architecture, demonstrating improvements on multiple language modeling (LM) datasets.
arXiv Detail & Related papers (2021-10-25T19:26:29Z) - Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for
Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model.
Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Echo State Neural Machine Translation [7.496705711191467]
We present neural machine translation (NMT) models inspired by echo state network (ESN), named Echo State NMT (ESNMT)
We show that even with this extremely simple model construction and training procedure, ESNMT can already reach 70-80% quality of fully trainable baselines.
arXiv Detail & Related papers (2020-02-27T00:08:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.