Time-Frequency Localization Using Deep Convolutional Maxout Neural
Network in Persian Speech Recognition
- URL: http://arxiv.org/abs/2108.03818v1
- Date: Mon, 9 Aug 2021 05:46:58 GMT
- Title: Time-Frequency Localization Using Deep Convolutional Maxout Neural
Network in Persian Speech Recognition
- Authors: Arash Dehghani, Seyyed Ali Seyyedsalehi
- Abstract summary: Time-frequency flexibility in some mammals' auditory neurons system improves recognition performance.
This paper proposes a CNN-based structure for time-frequency localization of audio signal information in the ASR acoustic model.
The average recognition score of TFCMNN models is about 1.6% higher than the average of conventional models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, a CNN-based structure for time-frequency localization of audio
signal information in the ASR acoustic model is proposed for Persian speech
recognition. Research has shown that the receptive fields' time-frequency
flexibility in some mammals' auditory neurons system improves recognition
performance. Biosystems have inspired many artificial systems because of their
high efficiency and performance, so time-frequency localization has been used
extensively to improve system performance. In the last few years, much work has
been done to localize time-frequency information in ASR systems, which has used
the spatial immutability properties of methods such as TDNN, CNN and LSTM-RNN.
However, most of these models have large parameter volumes and are challenging
to train. In the structure we have designed, called Time-Frequency
Convolutional Maxout Neural Network (TFCMNN), two parallel blocks consisting of
1D-CMNN each have weight sharing in one dimension, are applied simultaneously
but independently to the feature vectors. Then their output is concatenated and
applied to a fully connected Maxout network for classification. To improve the
performance of this structure, we have used newly developed methods and models
such as the maxout, Dropout, and weight normalization. Two experimental sets
were designed and implemented on the Persian FARSDAT speech data set to
evaluate the performance of this model compared to conventional 1D-CMNN models.
According to the experimental results, the average recognition score of TFCMNN
models is about 1.6% higher than the average of conventional models. In
addition, the average training time of the TFCMNN models is about 17 hours
lower than the average training time of traditional models. As a result, as
mentioned in other references, time-frequency localization in ASR systems
increases system accuracy and speeds up the model training process.
Related papers
- How neural networks learn to classify chaotic time series [77.34726150561087]
We study the inner workings of neural networks trained to classify regular-versus-chaotic time series.
We find that the relation between input periodicity and activation periodicity is key for the performance of LKCNN models.
arXiv Detail & Related papers (2023-06-04T08:53:27Z) - Continuous time recurrent neural networks: overview and application to
forecasting blood glucose in the intensive care unit [56.801856519460465]
Continuous time autoregressive recurrent neural networks (CTRNNs) are a deep learning model that account for irregular observations.
We demonstrate the application of these models to probabilistic forecasting of blood glucose in a critical care setting.
arXiv Detail & Related papers (2023-04-14T09:39:06Z) - Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.
In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z) - TSEM: Temporally Weighted Spatiotemporal Explainable Neural Network for
Multivariate Time Series [0.0]
We present a model-agnostic, model-specific approach to time series deep learning.
We show that TSEM outperforms XCM in terms of accuracy, while also satisfying a number of interpretability criteria.
arXiv Detail & Related papers (2022-05-25T18:54:25Z) - ANNETTE: Accurate Neural Network Execution Time Estimation with Stacked
Models [56.21470608621633]
We propose a time estimation framework to decouple the architectural search from the target hardware.
The proposed methodology extracts a set of models from micro- kernel and multi-layer benchmarks and generates a stacked model for mapping and network execution time estimation.
We compare estimation accuracy and fidelity of the generated mixed models, statistical models with the roofline model, and a refined roofline model for evaluation.
arXiv Detail & Related papers (2021-05-07T11:39:05Z) - Wireless Localisation in WiFi using Novel Deep Architectures [4.541069830146568]
This paper studies the indoor localisation of WiFi devices based on a commodity chipset and standard channel sounding.
We present a novel shallow neural network (SNN) in which features are extracted from the channel state information corresponding to WiFi subcarriers received on different antennas.
arXiv Detail & Related papers (2020-10-16T22:48:29Z) - Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks [61.76338096980383]
A range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper- parameters of state-of-the-art factored time delay neural networks (TDNNs)
These include the DARTS method integrating architecture selection with lattice-free MMI (LF-MMI) TDNN training.
Experiments conducted on a 300-hour Switchboard corpus suggest the auto-configured systems consistently outperform the baseline LF-MMI TDNN systems.
arXiv Detail & Related papers (2020-07-17T08:32:11Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z) - WaveCRN: An Efficient Convolutional Recurrent Neural Network for
End-to-end Speech Enhancement [31.236720440495994]
In this paper, we propose an efficient E2E SE model, termed WaveCRN.
In WaveCRN, the speech locality feature is captured by a convolutional neural network (CNN), while the temporal sequential property of the locality feature is modeled by stacked simple recurrent units (SRU)
In addition, in order to more effectively suppress the noise components in the input noisy speech, we derive a novel restricted feature masking (RFM) approach that performs enhancement on the feature maps in the hidden layers.
arXiv Detail & Related papers (2020-04-06T13:48:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.