Heterogeneous Reservoir Computing Models for Persian Speech Recognition
- URL: http://arxiv.org/abs/2205.12594v1
- Date: Wed, 25 May 2022 09:15:15 GMT
- Title: Heterogeneous Reservoir Computing Models for Persian Speech Recognition
- Authors: Zohreh Ansari, Farzin Pourhoseini, Fatemeh Hadaeghi
- Abstract summary: Reservoir computing models (RC) models have been proven inexpensive to train, have vastly fewer parameters, and are compatible with emergent hardware technologies.
We propose heterogeneous single and multi-layer ESNs to create non-linear transformations of the inputs that capture temporal context at different scales.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Over the last decade, deep-learning methods have been gradually incorporated
into conventional automatic speech recognition (ASR) frameworks to create
acoustic, pronunciation, and language models. Although it led to significant
improvements in ASRs' recognition accuracy, due to their hard constraints
related to hardware requirements (e.g., computing power and memory usage), it
is unclear if such approaches are the most computationally- and
energy-efficient options for embedded ASR applications. Reservoir computing
(RC) models (e.g., echo state networks (ESNs) and liquid state machines
(LSMs)), on the other hand, have been proven inexpensive to train, have vastly
fewer parameters, and are compatible with emergent hardware technologies.
However, their performance in speech processing tasks is relatively inferior to
that of the deep-learning-based models. To enhance the accuracy of the RC in
ASR applications, we propose heterogeneous single and multi-layer ESNs to
create non-linear transformations of the inputs that capture temporal context
at different scales. To test our models, we performed a speech recognition task
on the Farsdat Persian dataset. Since, to the best of our knowledge, standard
RC has not yet been employed to conduct any Persian ASR tasks, we also trained
conventional single-layer and deep ESNs to provide baselines for comparison.
Besides, we compared the RC performance with a standard long-short-term memory
(LSTM) model. Heterogeneous RC models (1) show improved performance to the
standard RC models; (2) perform on par in terms of recognition accuracy with
the LSTM, and (3) reduce the training time considerably.
Related papers
- CTC-Assisted LLM-Based Contextual ASR [40.6542391788212]
We propose a CTC-Assisted LLM-Based Contextual ASR model with an efficient filtering algorithm.
Our model attains WER/B-WER of 1.27%/3.67% and 2.72%/8.02% on the Librispeech test-clean and test-other sets targeting on recognizing rare long-tail words.
arXiv Detail & Related papers (2024-11-10T11:47:50Z) - Efficient infusion of self-supervised representations in Automatic Speech Recognition [1.2972104025246092]
Self-supervised learned (SSL) models such as Wav2vec and HuBERT yield state-of-the-art results on speech-related tasks.
We propose two simple approaches that use framewise addition and (2) cross-attention mechanisms to efficiently incorporate the representations from the SSL model into the ASR architecture.
Our approach results in faster training and yields significant performance gains on the Librispeech and Tedlium datasets.
arXiv Detail & Related papers (2024-04-19T05:01:12Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Incremental Online Learning Algorithms Comparison for Gesture and Visual
Smart Sensors [68.8204255655161]
This paper compares four state-of-the-art algorithms in two real applications: gesture recognition based on accelerometer data and image classification.
Our results confirm these systems' reliability and the feasibility of deploying them in tiny-memory MCUs.
arXiv Detail & Related papers (2022-09-01T17:05:20Z) - SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition [49.42625022146008]
We present the advantages of applying SRU++ in ASR tasks by comparing with Conformer across multiple ASR benchmarks.
Specifically, SRU++ can surpass Conformer on long-form speech input with a large margin, based on our analysis.
arXiv Detail & Related papers (2021-10-11T19:23:50Z) - Neural Model Reprogramming with Similarity Based Mapping for
Low-Resource Spoken Command Recognition [71.96870151495536]
We propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR)
The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model.
We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech.
arXiv Detail & Related papers (2021-10-08T05:07:35Z) - A baseline model for computationally inexpensive speech recognition for
Kazakh using the Coqui STT framework [0.0]
We train a new baseline acoustic model and three language models for use with the Coqui STT framework.
Results look promising, but further epochs of training and parameter sweeping are needed to reach a production-level accuracy.
arXiv Detail & Related papers (2021-07-19T14:17:42Z) - SynthASR: Unlocking Synthetic Data for Speech Recognition [15.292920497489925]
We propose to utilize synthetic speech for ASR training ( SynthASR) in applications where data is sparse or hard to get for ASR model training.
In our experiments conducted on in-house datasets for a new application of recognizing medication names, training ASR RNN-T models with synthetic audio improved the recognition performance on new application by more than 65% relative.
arXiv Detail & Related papers (2021-06-14T23:26:44Z) - Improving RNN Transducer Based ASR with Auxiliary Tasks [21.60022481898402]
End-to-end automatic speech recognition (ASR) models with a single neural network have recently demonstrated state-of-the-art results.
In this work, we examine ways in which recurrent neural network transducer (RNN-T) can achieve better ASR accuracy via performing auxiliary tasks.
arXiv Detail & Related papers (2020-11-05T21:46:32Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.