Adaptive Speech Quality Aware Complex Neural Network for Acoustic Echo
Cancellation with Supervised Contrastive Learning
- URL: http://arxiv.org/abs/2210.16791v2
- Date: Tue, 1 Nov 2022 14:41:34 GMT
- Title: Adaptive Speech Quality Aware Complex Neural Network for Acoustic Echo
Cancellation with Supervised Contrastive Learning
- Authors: Bozhong Liu, Xiaoxi Yu, Hantao Huang
- Abstract summary: Acoustic echo cancellation is designed to remove echoes, reverberation, and unwanted added sounds from the microphone signal.
This paper proposes adaptive speech quality complex neural networks to focus on specific tasks for real-time acoustic echo cancellation.
- Score: 3.1644851830271747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Acoustic echo cancellation (AEC) is designed to remove echoes, reverberation,
and unwanted added sounds from the microphone signal while maintaining the
quality of the near-end speaker's speech. This paper proposes adaptive speech
quality complex neural networks to focus on specific tasks for real-time
acoustic echo cancellation. In specific, we propose a complex modularize neural
network with different stages to focus on feature extraction, acoustic
separation, and mask optimization receptively. Furthermore, we adopt the
contrastive learning framework and novel speech quality aware loss functions to
further improve the performance. The model is trained with 72 hours for
pre-training and then 72 hours for fine-tuning. The proposed model outperforms
the state-of-the-art performance.
Related papers
- UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit
Normalization [60.43992089087448]
Dysarthric speech reconstruction systems aim to automatically convert dysarthric speech into normal-sounding speech.
We propose a Unit-DSR system, which harnesses the powerful domain-adaptation capacity of HuBERT for training efficiency improvement.
Compared with NED approaches, the Unit-DSR system only consists of a speech unit normalizer and a Unit HiFi-GAN vocoder, which is considerably simpler without cascaded sub-modules or auxiliary tasks.
arXiv Detail & Related papers (2024-01-26T06:08:47Z) - Deep model with built-in self-attention alignment for acoustic echo
cancellation [1.30661828021882]
We propose a deep learning architecture with built-in self-attention based alignment.
Our approach achieves significant improvements for difficult delay estimation cases on real recordings.
arXiv Detail & Related papers (2022-08-24T05:29:47Z) - End-to-End Binaural Speech Synthesis [71.1869877389535]
We present an end-to-end speech synthesis system that combines a low-bitrate audio system with a powerful decoder.
We demonstrate the capability of the adversarial loss in capturing environment effects needed to create an authentic auditory scene.
arXiv Detail & Related papers (2022-07-08T05:18:36Z) - Improving Speech Enhancement through Fine-Grained Speech Characteristics [42.49874064240742]
We propose a novel approach to speech enhancement aimed at improving perceptual quality and naturalness of enhanced signals.
We first identify key acoustic parameters that have been found to correlate well with voice quality.
We then propose objective functions which are aimed at reducing the difference between clean speech and enhanced speech with respect to these features.
arXiv Detail & Related papers (2022-07-01T07:04:28Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Personalized Speech Enhancement: New Models and Comprehensive Evaluation [27.572537325449158]
We propose two neural networks for personalized speech enhancement (PSE) models that achieve superior performance to the previously proposed VoiceFilter.
We also create test sets that capture a variety of scenarios that users can encounter during video conferencing.
Our results show that the proposed models can yield better speech recognition accuracy, speech intelligibility, and perceptual quality than the baseline models.
arXiv Detail & Related papers (2021-10-18T21:21:23Z) - Test-Time Adaptation Toward Personalized Speech Enhancement: Zero-Shot
Learning with Knowledge Distillation [26.39206098000297]
We propose a novel personalized speech enhancement method to adapt a compact denoising model to the test-time specificity.
Our goal in this test-time adaptation is to utilize no clean speech target of the test speaker.
Instead of the missing clean utterance target, we distill the more advanced denoising results from an overly large teacher model.
arXiv Detail & Related papers (2021-05-08T00:42:03Z) - Residual acoustic echo suppression based on efficient multi-task
convolutional neural network [0.0]
We propose a real-time residual acoustic echo suppression (RAES) method using an efficient convolutional neural network.
The training criterion is based on a novel loss function, which we call as the suppression loss, to balance the suppression of residual echo and the distortion of near-end signals.
arXiv Detail & Related papers (2020-09-29T11:26:25Z) - Multi-task self-supervised learning for Robust Speech Recognition [75.11748484288229]
This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments.
We employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances.
We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks.
arXiv Detail & Related papers (2020-01-25T00:24:45Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.