DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic
Echo Cancellation, Noise Suppression and Dereverberation
- URL: http://arxiv.org/abs/2306.03177v1
- Date: Mon, 5 Jun 2023 18:37:05 GMT
- Title: DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic
Echo Cancellation, Noise Suppression and Dereverberation
- Authors: Evgenii Indenbom, Nicolae-Catalin Ristea, Ando Saabas, Tanel Parnamaa,
Jegor Guzvin, Ross Cutler
- Abstract summary: This paper proposes a real-time cross-attention deep model named DeepVQE, based on residual convolutional neural networks (CNNs) and recurrent neural networks (RNNs)
We conduct ablation studies analyze the contributions of different components of our model to achieve the overall performance.
DeepVQE state-of-the-art performance on nonpersonalized tracks from the ICASSP 2023 Acoustic Echo Challenge and ICASSP 2023 Deep Noise Suppression Challenge test sets, showing that a single model can handle multiple tasks with excellent performance.
- Score: 12.734839065028547
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Acoustic echo cancellation (AEC), noise suppression (NS) and dereverberation
(DR) are an integral part of modern full-duplex communication systems. As the
demand for teleconferencing systems increases, addressing these tasks is
required for an effective and efficient online meeting experience. Most prior
research proposes solutions for these tasks separately, combining them with
digital signal processing (DSP) based components, resulting in complex
pipelines that are often impractical to deploy in real-world applications. This
paper proposes a real-time cross-attention deep model, named DeepVQE, based on
residual convolutional neural networks (CNNs) and recurrent neural networks
(RNNs) to simultaneously address AEC, NS, and DR. We conduct several ablation
studies to analyze the contributions of different components of our model to
the overall performance. DeepVQE achieves state-of-the-art performance on
non-personalized tracks from the ICASSP 2023 Acoustic Echo Cancellation
Challenge and ICASSP 2023 Deep Noise Suppression Challenge test sets, showing
that a single model can handle multiple tasks with excellent performance.
Moreover, the model runs in real-time and has been successfully tested for the
Microsoft Teams platform.
Related papers
- Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism.
The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders.
Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z) - Complex-Valued Time-Frequency Self-Attention for Speech Dereverberation [39.64103126881576]
We propose a complex-valued T-F attention (TFA) module that models spectral and temporal dependencies.
We validate the effectiveness of our proposed complex-valued TFA module with the deep complex convolutional recurrent network (DCCRN) using the REVERB challenge corpus.
Experimental findings indicate that integrating our complex-TFA module with DCCRN improves overall speech quality and performance of back-end speech applications.
arXiv Detail & Related papers (2022-11-22T23:38:10Z) - Wider or Deeper Neural Network Architecture for Acoustic Scene
Classification with Mismatched Recording Devices [59.86658316440461]
We present a robust and low complexity system for Acoustic Scene Classification (ASC)
We first construct an ASC baseline system in which a novel inception-residual-based network architecture is proposed to deal with the mismatched recording device issue.
To further improve the performance but still satisfy the low complexity model, we apply two techniques: ensemble of multiple spectrograms and channel reduction.
arXiv Detail & Related papers (2022-03-23T10:27:41Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - Compute and memory efficient universal sound source separation [23.152611264259225]
We provide a family of efficient neural network architectures for general purpose audio source separation.
The backbone structure of this convolutional network is the SUccessive DOwnsampling and Resampling of Multi-Resolution Features (SuDoRM-RF)
Our experiments show that SuDoRM-RF models perform comparably and even surpass several state-of-the-art benchmarks.
arXiv Detail & Related papers (2021-03-03T19:16:53Z) - Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy
Loss [49.62291237343537]
We propose a Perceptual Entropy (PE) loss derived from a psycho-acoustic hearing model to regularize the network.
With a one-hour open-source singing voice database, we explore the impact of the PE loss on various mainstream sequence-to-sequence models.
arXiv Detail & Related papers (2020-10-22T20:14:59Z) - DD-CNN: Depthwise Disout Convolutional Neural Network for Low-complexity
Acoustic Scene Classification [29.343805468175965]
This paper presents a Depthwise Disout Convolutional Neural Network (DD-CNN) for the detection and classification of urban acoustic scenes.
We use log-mel as feature representations of acoustic signals for the inputs of our network.
In the proposed DD-CNN, depthwise separable convolution is used to reduce the network complexity.
arXiv Detail & Related papers (2020-07-25T06:02:20Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.