Multi-Loss Convolutional Network with Time-Frequency Attention for
Speech Enhancement
- URL: http://arxiv.org/abs/2306.08956v1
- Date: Thu, 15 Jun 2023 08:48:19 GMT
- Title: Multi-Loss Convolutional Network with Time-Frequency Attention for
Speech Enhancement
- Authors: Liang Wan and Hongqing Liu and Yi Zhou and Jie Ji
- Abstract summary: We explore self-attention in the DPCRN module and design a model called Multi-Loss Convolutional Network with Time-Frequency Attention(MNTFA) for speech enhancement.
Compared to DPRNN, axial self-attention greatly reduces the need for memory and computation.
We propose a joint training method of a multi-resolution STFT loss and a WavLM loss using a pre-trained WavLM network.
- Score: 16.701596804113553
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Dual-Path Convolution Recurrent Network (DPCRN) was proposed to
effectively exploit time-frequency domain information. By combining the DPRNN
module with Convolution Recurrent Network (CRN), the DPCRN obtained a promising
performance in speech separation with a limited model size. In this paper, we
explore self-attention in the DPCRN module and design a model called Multi-Loss
Convolutional Network with Time-Frequency Attention(MNTFA) for speech
enhancement. We use self-attention modules to exploit the long-time
information, where the intra-chunk self-attentions are used to model the
spectrum pattern and the inter-chunk self-attention are used to model the
dependence between consecutive frames. Compared to DPRNN, axial self-attention
greatly reduces the need for memory and computation, which is more suitable for
long sequences of speech signals. In addition, we propose a joint training
method of a multi-resolution STFT loss and a WavLM loss using a pre-trained
WavLM network. Experiments show that with only 0.23M parameters, the proposed
model achieves a better performance than DPCRN.
Related papers
- Cascaded Temporal Updating Network for Efficient Video Super-Resolution [47.63267159007611]
Key components in recurrent-based VSR networks significantly impact model efficiency.
We propose a cascaded temporal updating network (CTUN) for efficient VSR.
CTUN achieves a favorable trade-off between efficiency and performance compared to existing methods.
arXiv Detail & Related papers (2024-08-26T12:59:32Z) - Delayed Memory Unit: Modelling Temporal Dependency Through Delay Gate [17.611912733951662]
Recurrent Neural Networks (RNNs) are renowned for their adeptness in modeling temporal dependencies.
We propose a novel Delayed Memory Unit (DMU) in this paper to enhance the temporal modeling capabilities of vanilla RNNs.
Our proposed DMU demonstrates superior temporal modeling capabilities across a broad range of sequential modeling tasks.
arXiv Detail & Related papers (2023-10-23T14:29:48Z) - Complex-Valued Time-Frequency Self-Attention for Speech Dereverberation [39.64103126881576]
We propose a complex-valued T-F attention (TFA) module that models spectral and temporal dependencies.
We validate the effectiveness of our proposed complex-valued TFA module with the deep complex convolutional recurrent network (DCCRN) using the REVERB challenge corpus.
Experimental findings indicate that integrating our complex-TFA module with DCCRN improves overall speech quality and performance of back-end speech applications.
arXiv Detail & Related papers (2022-11-22T23:38:10Z) - Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture.
The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z) - Recurrence-in-Recurrence Networks for Video Deblurring [58.49075799159015]
State-of-the-art video deblurring methods often adopt recurrent neural networks to model the temporal dependency between the frames.
In this paper, we propose recurrence-in-recurrence network architecture to cope with the limitations of short-ranged memory.
arXiv Detail & Related papers (2022-03-12T11:58:13Z) - MIMO Self-attentive RNN Beamformer for Multi-speaker Speech Separation [45.90599689005832]
Recently, our proposed recurrent neural network (RNN) based all deep learning minimum variance distortionless response (ADL-MVDR) beamformer method yielded superior performance over the conventional MVDR.
We present a self-attentive RNN beamformer to further improve our previous RNN-based beamformer by leveraging on the powerful modeling capability of self-attention.
arXiv Detail & Related papers (2021-04-17T05:02:04Z) - Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions.
The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z) - WaveCRN: An Efficient Convolutional Recurrent Neural Network for
End-to-end Speech Enhancement [31.236720440495994]
In this paper, we propose an efficient E2E SE model, termed WaveCRN.
In WaveCRN, the speech locality feature is captured by a convolutional neural network (CNN), while the temporal sequential property of the locality feature is modeled by stacked simple recurrent units (SRU)
In addition, in order to more effectively suppress the noise components in the input noisy speech, we derive a novel restricted feature masking (RFM) approach that performs enhancement on the feature maps in the hidden layers.
arXiv Detail & Related papers (2020-04-06T13:48:05Z) - Single Channel Speech Enhancement Using Temporal Convolutional Recurrent
Neural Networks [23.88788382262305]
temporal convolutional recurrent network (TCRN) is an end-to-end model that directly map noisy waveform to clean waveform.
We show that our model is able to improve the performance of model, compared with existing convolutional recurrent networks.
arXiv Detail & Related papers (2020-02-02T04:26:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.