WaveCRN: An Efficient Convolutional Recurrent Neural Network for
End-to-end Speech Enhancement
- URL: http://arxiv.org/abs/2004.04098v3
- Date: Thu, 26 Nov 2020 07:28:32 GMT
- Title: WaveCRN: An Efficient Convolutional Recurrent Neural Network for
End-to-end Speech Enhancement
- Authors: Tsun-An Hsieh, Hsin-Min Wang, Xugang Lu, and Yu Tsao
- Abstract summary: In this paper, we propose an efficient E2E SE model, termed WaveCRN.
In WaveCRN, the speech locality feature is captured by a convolutional neural network (CNN), while the temporal sequential property of the locality feature is modeled by stacked simple recurrent units (SRU)
In addition, in order to more effectively suppress the noise components in the input noisy speech, we derive a novel restricted feature masking (RFM) approach that performs enhancement on the feature maps in the hidden layers.
- Score: 31.236720440495994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the simple design pipeline, end-to-end (E2E) neural models for speech
enhancement (SE) have attracted great interest. In order to improve the
performance of the E2E model, the locality and temporal sequential properties
of speech should be efficiently taken into account when modelling. However, in
most current E2E models for SE, these properties are either not fully
considered or are too complex to be realized. In this paper, we propose an
efficient E2E SE model, termed WaveCRN. In WaveCRN, the speech locality feature
is captured by a convolutional neural network (CNN), while the temporal
sequential property of the locality feature is modeled by stacked simple
recurrent units (SRU). Unlike a conventional temporal sequential model that
uses a long short-term memory (LSTM) network, which is difficult to
parallelize, SRU can be efficiently parallelized in calculation with even fewer
model parameters. In addition, in order to more effectively suppress the noise
components in the input noisy speech, we derive a novel restricted feature
masking (RFM) approach that performs enhancement on the feature maps in the
hidden layers; this is different from the approach that applies the estimated
ratio mask on the noisy spectral features, which is commonly used in speech
separation methods. Experimental results on speech denoising and compressed
speech restoration tasks confirm that with the lightweight architecture of SRU
and the feature-mapping-based RFM, WaveCRN performs comparably with other
state-of-the-art approaches with notably reduced model complexity and inference
time.
Related papers
- TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation [18.93255531121519]
We present a novel time-frequency domain audio-visual speech separation method.
RTFS-Net applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform.
This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
arXiv Detail & Related papers (2023-09-29T12:38:00Z) - Multi-Loss Convolutional Network with Time-Frequency Attention for
Speech Enhancement [16.701596804113553]
We explore self-attention in the DPCRN module and design a model called Multi-Loss Convolutional Network with Time-Frequency Attention(MNTFA) for speech enhancement.
Compared to DPRNN, axial self-attention greatly reduces the need for memory and computation.
We propose a joint training method of a multi-resolution STFT loss and a WavLM loss using a pre-trained WavLM network.
arXiv Detail & Related papers (2023-06-15T08:48:19Z) - Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification.
We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information.
SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z) - Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.
In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z) - TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding [60.292702363839716]
Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation.
We propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs.
arXiv Detail & Related papers (2022-03-17T05:49:35Z) - Time-Frequency Localization Using Deep Convolutional Maxout Neural
Network in Persian Speech Recognition [0.0]
Time-frequency flexibility in some mammals' auditory neurons system improves recognition performance.
This paper proposes a CNN-based structure for time-frequency localization of audio signal information in the ASR acoustic model.
The average recognition score of TFCMNN models is about 1.6% higher than the average of conventional models.
arXiv Detail & Related papers (2021-08-09T05:46:58Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z) - Single Channel Speech Enhancement Using Temporal Convolutional Recurrent
Neural Networks [23.88788382262305]
temporal convolutional recurrent network (TCRN) is an end-to-end model that directly map noisy waveform to clean waveform.
We show that our model is able to improve the performance of model, compared with existing convolutional recurrent networks.
arXiv Detail & Related papers (2020-02-02T04:26:50Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.