Leveraging End-to-End Speech Recognition with Neural Architecture Search
- URL: http://arxiv.org/abs/1912.05946v2
- Date: Sat, 20 May 2023 23:27:51 GMT
- Title: Leveraging End-to-End Speech Recognition with Neural Architecture Search
- Authors: Ahmed Baruwa, Mojeed Abisiga, Ibrahim Gbadegesin, Afeez Fakunle
- Abstract summary: We show that a large improvement in the accuracy of deep speech models can be achieved with effective Neural Architecture Optimization.
Our method achieves test error of 7% Word Error Rate (WER) on the LibriSpeech corpus and 13% Phone Error Rate (PER) on the TIMIT corpus, on par with state-of-the-art results.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Deep neural networks (DNNs) have been demonstrated to outperform many
traditional machine learning algorithms in Automatic Speech Recognition (ASR).
In this paper, we show that a large improvement in the accuracy of deep speech
models can be achieved with effective Neural Architecture Optimization at a
very low computational cost. Phone recognition tests with the popular
LibriSpeech and TIMIT benchmarks proved this fact by displaying the ability to
discover and train novel candidate models within a few hours (less than a day)
many times faster than the attention-based seq2seq models. Our method achieves
test error of 7% Word Error Rate (WER) on the LibriSpeech corpus and 13% Phone
Error Rate (PER) on the TIMIT corpus, on par with state-of-the-art results.
Related papers
- BayesSpeech: A Bayesian Transformer Network for Automatic Speech
Recognition [0.0]
Recent developments using End-to-End Deep Learning models have been shown to have near or better performance than state of the art Recurrent Neural Networks (RNNs) on Automatic Speech Recognition tasks.
We show how the introduction of variance in the weights leads to faster training time and near state-of-the-art performance on LibriSpeech-960.
arXiv Detail & Related papers (2023-01-16T16:19:04Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - LongFNT: Long-form Speech Recognition with Factorized Neural Transducer [64.75547712366784]
We propose the LongFNT-Text architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor.
The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate(WER) reduction, respectively.
arXiv Detail & Related papers (2022-11-17T08:48:27Z) - Prediction of speech intelligibility with DNN-based performance measures [9.883633991083789]
This paper presents a speech intelligibility model based on automatic speech recognition (ASR)
It combines phoneme probabilities from deep neural networks (DNN) and a performance measure that estimates the word error rate from these probabilities.
The proposed model performs almost as well as the label-based model and produces more accurate predictions than the baseline models.
arXiv Detail & Related papers (2022-03-17T08:05:38Z) - SAFL: A Self-Attention Scene Text Recognizer with Focal Loss [4.462730814123762]
Scene text recognition remains challenging due to inherent problems such as distortions or irregular layout.
Most of the existing approaches mainly leverage recurrence or convolution-based neural networks.
We introduce SAFL, a self-attention-based neural network model with the focal loss for scene text recognition.
arXiv Detail & Related papers (2022-01-01T06:51:03Z) - StutterNet: Stuttering Detection Using Time Delay Neural Network [9.726119468893721]
This paper introduce StutterNet, a novel deep learning based stuttering detection system.
We use a time-delay neural network (TDNN) suitable for capturing contextual aspects of the disfluent utterances.
Our method achieves promising results and outperforms the state-of-the-art residual neural network based method.
arXiv Detail & Related papers (2021-05-12T11:36:01Z) - Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks [61.76338096980383]
A range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper- parameters of state-of-the-art factored time delay neural networks (TDNNs)
These include the DARTS method integrating architecture selection with lattice-free MMI (LF-MMI) TDNN training.
Experiments conducted on a 300-hour Switchboard corpus suggest the auto-configured systems consistently outperform the baseline LF-MMI TDNN systems.
arXiv Detail & Related papers (2020-07-17T08:32:11Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Recognizing Long Grammatical Sequences Using Recurrent Networks
Augmented With An External Differentiable Stack [73.48927855855219]
Recurrent neural networks (RNNs) are a widely used deep architecture for sequence modeling, generation, and prediction.
RNNs generalize poorly over very long sequences, which limits their applicability to many important temporal processing and time series forecasting problems.
One way to address these shortcomings is to couple an RNN with an external, differentiable memory structure, such as a stack.
In this paper, we improve the memory-augmented RNN with important architectural and state updating mechanisms.
arXiv Detail & Related papers (2020-04-04T14:19:15Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.