End-to-End Complex-Valued Multidilated Convolutional Neural Network for
Joint Acoustic Echo Cancellation and Noise Suppression
- URL: http://arxiv.org/abs/2110.00745v1
- Date: Sat, 2 Oct 2021 07:41:41 GMT
- Title: End-to-End Complex-Valued Multidilated Convolutional Neural Network for
Joint Acoustic Echo Cancellation and Noise Suppression
- Authors: Karn N. Watcharasupat, Thi Ngoc Tho Nguyen, Woon-Seng Gan, Shengkui
Zhao, and Bin Ma
- Abstract summary: In this paper, we exploit the offset-compensating ability of complex time-frequency masks and propose an end-to-end complex neural network architecture.
We also propose a dual-mask technique for joint echo and noise suppression with simultaneous speech enhancement.
- Score: 25.04740291728234
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Echo and noise suppression is an integral part of a full-duplex communication
system. Many recent acoustic echo cancellation (AEC) systems rely on a separate
adaptive filtering module for linear echo suppression and a neural module for
residual echo suppression. However, not only do adaptive filtering modules
require convergence and remain susceptible to changes in acoustic environments,
but this two-stage framework also often introduces unnecessary delays to the
AEC system when neural modules are already capable of both linear and nonlinear
echo suppression. In this paper, we exploit the offset-compensating ability of
complex time-frequency masks and propose an end-to-end complex-valued neural
network architecture. The building block of the proposed model is a
pseudocomplex extension based on the densely-connected multidilated DenseNet
(D3Net) building block, resulting in a very small network of only 354K
parameters. The architecture utilized the multi-resolution nature of the D3Net
building blocks to eliminate the need for pooling, allowing the network to
extract features using large receptive fields without any loss of output
resolution. We also propose a dual-mask technique for joint echo and noise
suppression with simultaneous speech enhancement. Evaluation on both synthetic
and real test sets demonstrated promising results across multiple energy-based
metrics and perceptual proxies.
Related papers
- Time-Variance Aware Real-Time Speech Enhancement [27.180179632422853]
Current end-to-end deep neural network (DNN) based methods usually model time-variant components implicitly.
We propose a dynamic kernel generation (DKG) module that can be introduced as a learnable plug-in to a DNN-based end-to-end pipeline.
Experimental results verify that DKG module improves the performance of the model under time-variant scenarios.
arXiv Detail & Related papers (2023-02-25T11:37:35Z) - Speech-enhanced and Noise-aware Networks for Robust Speech Recognition [25.279902171523233]
A noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition.
The two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task.
Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively.
arXiv Detail & Related papers (2022-03-25T15:04:51Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Deep Residual Echo Suppression with A Tunable Tradeoff Between Signal
Distortion and Echo Suppression [13.558688470594676]
A UNet neural network maps the outputs of a linear acoustic echo canceler to the desired signal in the spectral domain.
The system employs 136 thousand parameters, and requires 1.6 Giga floating-point operations per second and 10 Mega-bytes of memory.
arXiv Detail & Related papers (2021-06-25T09:49:18Z) - Conditioning Trick for Training Stable GANs [70.15099665710336]
We propose a conditioning trick, called difference departure from normality, applied on the generator network in response to instability issues during GAN training.
We force the generator to get closer to the departure from normality function of real samples computed in the spectral domain of Schur decomposition.
arXiv Detail & Related papers (2020-10-12T16:50:22Z) - Deep Denoising Neural Network Assisted Compressive Channel Estimation
for mmWave Intelligent Reflecting Surfaces [99.34306447202546]
This paper proposes a deep denoising neural network assisted compressive channel estimation for mmWave IRS systems.
We first introduce a hybrid passive/active IRS architecture, where very few receive chains are employed to estimate the uplink user-to-IRS channels.
The complete channel matrix can be reconstructed from the limited measurements based on compressive sensing.
arXiv Detail & Related papers (2020-06-03T12:18:57Z) - Nonlinear Residual Echo Suppression Based on Multi-stream Conv-TasNet [22.56178941790508]
We propose a residual echo suppression method based on the modification of fully convolutional time-domain audio separation network (Conv-TasNet)
Both the residual signal of the linear acoustic echo cancellation system, and the output of the adaptive filter are adopted to form multiple streams for the Conv-TasNet.
arXiv Detail & Related papers (2020-05-15T16:41:16Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.