Receptive Field Analysis of Temporal Convolutional Networks for Monaural
Speech Dereverberation
- URL: http://arxiv.org/abs/2204.06439v2
- Date: Thu, 14 Apr 2022 09:10:50 GMT
- Title: Receptive Field Analysis of Temporal Convolutional Networks for Monaural
Speech Dereverberation
- Authors: William Ravenscroft, Stefan Goetze, Thomas Hain
- Abstract summary: Supervised deep learning (DL) models give state-of-the-art performance for single-channel speech dereverberation.
Temporal convolutional networks (TCNs) are commonly used for sequence modelling in speech enhancement tasks.
This paper analyses dereverberation performance depending on the model size and the receptive field of TCNs.
- Score: 26.94528951545861
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech dereverberation is often an important requirement in robust speech
processing tasks. Supervised deep learning (DL) models give state-of-the-art
performance for single-channel speech dereverberation. Temporal convolutional
networks (TCNs) are commonly used for sequence modelling in speech enhancement
tasks. A feature of TCNs is that they have a receptive field (RF) dependant on
the specific model configuration which determines the number of input frames
that can be observed to produce an individual output frame. It has been shown
that TCNs are capable of performing dereverberation of simulated speech data,
however a thorough analysis, especially with focus on the RF is yet lacking in
the literature. This paper analyses dereverberation performance depending on
the model size and the RF of TCNs. Experiments using the WHAMR corpus which is
extended to include room impulse responses (RIRs) with larger T60 values
demonstrate that a larger RF can have significant improvement in performance
when training smaller TCN models. It is also demonstrated that TCNs benefit
from a wider RF when dereverberating RIRs with larger RT60 values.
Related papers
- Multi-Loss Convolutional Network with Time-Frequency Attention for
Speech Enhancement [16.701596804113553]
We explore self-attention in the DPCRN module and design a model called Multi-Loss Convolutional Network with Time-Frequency Attention(MNTFA) for speech enhancement.
Compared to DPRNN, axial self-attention greatly reduces the need for memory and computation.
We propose a joint training method of a multi-resolution STFT loss and a WavLM loss using a pre-trained WavLM network.
arXiv Detail & Related papers (2023-06-15T08:48:19Z) - How neural networks learn to classify chaotic time series [77.34726150561087]
We study the inner workings of neural networks trained to classify regular-versus-chaotic time series.
We find that the relation between input periodicity and activation periodicity is key for the performance of LKCNN models.
arXiv Detail & Related papers (2023-06-04T08:53:27Z) - Deformable Temporal Convolutional Networks for Monaural Noisy
Reverberant Speech Separation [26.94528951545861]
Speech separation models are used for isolating individual speakers in many speech processing applications.
Deep learning models have been shown to lead to state-of-the-art (SOTA) results on a number of speech separation benchmarks.
One such class of models known as temporal convolutional networks (TCNs) has shown promising results for speech separation tasks.
Recent research in speech dereverberation has shown that the optimal RF of a TCN varies with the reverberation characteristics of the speech signal.
arXiv Detail & Related papers (2022-10-27T10:29:19Z) - Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based
On FullConv-TTS [0.0]
We propose a new text-to-speech system based on deep convolutional neural networks that does not employ any RNN components (recurrent units)
At the same time, we improve the generality and robustness of our model through a series of data augmentation methods such as Time Warping, Frequency Mask, and Time Mask.
The final experimental results show that the TTS model using only the CNN component can reduce the training time compared to the classic TTS models such as Tacotron.
arXiv Detail & Related papers (2022-10-24T14:18:43Z) - VQ-T: RNN Transducers using Vector-Quantized Prediction Network States [52.48566999668521]
We propose to use vector-quantized long short-term memory units in the prediction network of RNN transducers.
By training the discrete representation jointly with the ASR network, hypotheses can be actively merged for lattice generation.
Our experiments on the Switchboard corpus show that the proposed VQ RNN transducers improve ASR performance over transducers with regular prediction networks.
arXiv Detail & Related papers (2022-08-03T02:45:52Z) - Utterance Weighted Multi-Dilation Temporal Convolutional Networks for
Monaural Speech Dereverberation [26.94528951545861]
A weighted multi-dilation depthwise-separable convolution is proposed to replace standard depthwise-separable convolutions in temporal convolutional networks (TCNs)
It is shown that this weighted multi-dilation temporal convolutional network (WD-TCN) consistently outperforms the TCN across various model configurations.
arXiv Detail & Related papers (2022-05-17T15:56:31Z) - The Spectral Bias of Polynomial Neural Networks [63.27903166253743]
Polynomial neural networks (PNNs) have been shown to be particularly effective at image generation and face recognition, where high-frequency information is critical.
Previous studies have revealed that neural networks demonstrate a $textitspectral bias$ towards low-frequency functions, which yields faster learning of low-frequency components during training.
Inspired by such studies, we conduct a spectral analysis of the Tangent Kernel (NTK) of PNNs.
We find that the $Pi$-Net family, i.e., a recently proposed parametrization of PNNs, speeds up the
arXiv Detail & Related papers (2022-02-27T23:12:43Z) - MFA: TDNN with Multi-scale Frequency-channel Attention for
Text-independent Speaker Verification with Short Utterances [94.70787497137854]
We propose a multi-scale frequency-channel attention (MFA) to characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN.
We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and complexity.
arXiv Detail & Related papers (2022-02-03T14:57:05Z) - Deep Networks for Direction-of-Arrival Estimation in Low SNR [89.45026632977456]
We introduce a Convolutional Neural Network (CNN) that is trained from mutli-channel data of the true array manifold matrix.
We train a CNN in the low-SNR regime to predict DoAs across all SNRs.
Our robust solution can be applied in several fields, ranging from wireless array sensors to acoustic microphones or sonars.
arXiv Detail & Related papers (2020-11-17T12:52:18Z) - Deep Time Delay Neural Network for Speech Enhancement with Full Data
Learning [60.20150317299749]
This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learning.
To make full use of the training data, we propose a full data learning method for speech enhancement.
arXiv Detail & Related papers (2020-11-11T06:32:37Z) - Single Channel Speech Enhancement Using Temporal Convolutional Recurrent
Neural Networks [23.88788382262305]
temporal convolutional recurrent network (TCRN) is an end-to-end model that directly map noisy waveform to clean waveform.
We show that our model is able to improve the performance of model, compared with existing convolutional recurrent networks.
arXiv Detail & Related papers (2020-02-02T04:26:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.