Effects of Number of Filters of Convolutional Layers on Speech
Recognition Model Accuracy
- URL: http://arxiv.org/abs/2102.02326v1
- Date: Wed, 3 Feb 2021 23:04:38 GMT
- Title: Effects of Number of Filters of Convolutional Layers on Speech
Recognition Model Accuracy
- Authors: James Mou, Jun Li
- Abstract summary: This paper studies the effects of Number of Filters of convolutional layers on the model prediction accuracy of CNN+RNN (Convolutional Networks adding to Recurrent Networks) for ASR Models (Automatic Speech Recognition)
Experimental results show that only when the CNN Number of Filters exceeds a certain threshold value is adding CNN to RNN able to improve the performance of the CNN+RNN speech recognition model.
- Score: 6.2698513174194215
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inspired by the progress of the End-to-End approach [1], this paper
systematically studies the effects of Number of Filters of convolutional layers
on the model prediction accuracy of CNN+RNN (Convolutional Neural Networks
adding to Recurrent Neural Networks) for ASR Models (Automatic Speech
Recognition). Experimental results show that only when the CNN Number of
Filters exceeds a certain threshold value is adding CNN to RNN able to improve
the performance of the CNN+RNN speech recognition model, otherwise some
parameter ranges of CNN can render it useless to add the CNN to the RNN model.
Our results show a strong dependency of word accuracy on the Number of Filters
of convolutional layers. Based on the experimental results, the paper suggests
a possible hypothesis of Sound-2-Vector Embedding (Convolutional Embedding) to
explain the above observations.
Based on this Embedding hypothesis and the optimization of parameters, the
paper develops an End-to-End speech recognition system which has a high word
accuracy but also has a light model-weight. The developed LVCSR (Large
Vocabulary Continuous Speech Recognition) model has achieved quite a high word
accuracy of 90.2% only by its Acoustic Model alone, without any assistance from
intermediate phonetic representation and any Language Model. Its acoustic model
contains only 4.4 million weight parameters, compared to the 35~68 million
acoustic-model weight parameters in DeepSpeech2 [2] (one of the top
state-of-the-art LVCSR models) which can achieve a word accuracy of 91.5%. The
light-weighted model is good for improving the transcribing computing
efficiency and also useful for mobile devices, Driverless Vehicles, etc. Our
model weight is reduced to ~10% the size of DeepSpeech2, but our model accuracy
remains close to that of DeepSpeech2. If combined with a Language Model, our
LVCSR system is able to achieve 91.5% word accuracy.
Related papers
- Evaluating raw waveforms with deep learning frameworks for speech
emotion recognition [0.0]
We represent a model, which feeds raw audio files directly into the deep neural networks without any feature extraction stage.
We use six different data sets, EMO-DB, RAVDESS, TESS, CREMA, SAVEE, and TESS+RAVDESS.
The proposed model performs 90.34% of accuracy for EMO-DB with CNN model, 90.42% of accuracy for RAVDESS, 99.48% of accuracy for TESS with LSTM model, 69.72% of accuracy for CREMA with CNN model, 85.76% of accuracy for SAVEE with CNN model in
arXiv Detail & Related papers (2023-07-06T07:27:59Z) - Prediction of speech intelligibility with DNN-based performance measures [9.883633991083789]
This paper presents a speech intelligibility model based on automatic speech recognition (ASR)
It combines phoneme probabilities from deep neural networks (DNN) and a performance measure that estimates the word error rate from these probabilities.
The proposed model performs almost as well as the label-based model and produces more accurate predictions than the baseline models.
arXiv Detail & Related papers (2022-03-17T08:05:38Z) - Exploiting Hybrid Models of Tensor-Train Networks for Spoken Command
Recognition [9.262289183808035]
This work aims to design a low complexity spoken command recognition (SCR) system.
We exploit a deep hybrid architecture of a tensor-train (TT) network to build an end-to-end SRC pipeline.
Our proposed CNN+(TT-DNN) model attains a competitive accuracy of 96.31% with 4 times fewer model parameters than the CNN model.
arXiv Detail & Related papers (2022-01-11T05:57:38Z) - ANNETTE: Accurate Neural Network Execution Time Estimation with Stacked
Models [56.21470608621633]
We propose a time estimation framework to decouple the architectural search from the target hardware.
The proposed methodology extracts a set of models from micro- kernel and multi-layer benchmarks and generates a stacked model for mapping and network execution time estimation.
We compare estimation accuracy and fidelity of the generated mixed models, statistical models with the roofline model, and a refined roofline model for evaluation.
arXiv Detail & Related papers (2021-05-07T11:39:05Z) - Efficient End-to-End Speech Recognition Using Performers in Conformers [74.71219757585841]
We propose to reduce the complexity of model architectures in addition to model sizes.
The proposed model yields competitive performance on the LibriSpeech corpus with 10 millions of parameters and linear complexity.
arXiv Detail & Related papers (2020-11-09T05:22:57Z) - DNN-Based Semantic Model for Rescoring N-best Speech Recognition List [8.934497552812012]
The word error rate (WER) of an automatic speech recognition (ASR) system increases when a mismatch occurs between the training and the testing conditions due to the noise, etc.
This work aims to improve ASR by modeling long-term semantic relations to compensate for distorted acoustic features.
arXiv Detail & Related papers (2020-11-02T13:50:59Z) - Exploring Deep Hybrid Tensor-to-Vector Network Architectures for
Regression Based Speech Enhancement [53.47564132861866]
We find that a hybrid architecture, namely CNN-TT, is capable of maintaining a good quality performance with a reduced model parameter size.
CNN-TT is composed of several convolutional layers at the bottom for feature extraction to improve speech quality.
arXiv Detail & Related papers (2020-07-25T22:21:05Z) - Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR)
We propose the convolution-augmented transformer for speech recognition, named Conformer.
On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.