Deep model with built-in self-attention alignment for acoustic echo
cancellation
- URL: http://arxiv.org/abs/2208.11308v1
- Date: Wed, 24 Aug 2022 05:29:47 GMT
- Title: Deep model with built-in self-attention alignment for acoustic echo
cancellation
- Authors: Evgenii Indenbom, Nicolae-C\u{a}t\u{a}lin Ristea, Ando Saabas, Tanel
P\"arnamaa, Jegor Gu\v{z}vin
- Abstract summary: We propose a deep learning architecture with built-in self-attention based alignment.
Our approach achieves significant improvements for difficult delay estimation cases on real recordings.
- Score: 1.30661828021882
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: With recent research advances, deep learning models have become an attractive
choice for acoustic echo cancellation (AEC) in real-time teleconferencing
applications. Since acoustic echo is one of the major sources of poor audio
quality, a wide variety of deep models have been proposed. However, an
important but often omitted requirement for good echo cancellation quality is
the synchronization of the microphone and far end signals. Typically
implemented using classical algorithms based on cross-correlation, the
alignment module is a separate functional block with known design limitations.
In our work we propose a deep learning architecture with built-in
self-attention based alignment, which is able to handle unaligned inputs,
improving echo cancellation performance while simplifying the communication
pipeline. Moreover, we show that our approach achieves significant improvements
for difficult delay estimation cases on real recordings from AEC Challenge data
set.
Related papers
- Tailored Design of Audio-Visual Speech Recognition Models using Branchformers [0.0]
We propose a novel framework for the design of parameter-efficient Audio-Visual Speech Recognition systems.
To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder.
Results reflect how our tailored AVSR system is able to reach state-of-the-art recognition rates.
arXiv Detail & Related papers (2024-07-09T07:15:56Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Adaptive Speech Quality Aware Complex Neural Network for Acoustic Echo
Cancellation with Supervised Contrastive Learning [3.1644851830271747]
Acoustic echo cancellation is designed to remove echoes, reverberation, and unwanted added sounds from the microphone signal.
This paper proposes adaptive speech quality complex neural networks to focus on specific tasks for real-time acoustic echo cancellation.
arXiv Detail & Related papers (2022-10-30T09:42:03Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - Data Augmentation based Consistency Contrastive Pre-training for
Automatic Speech Recognition [18.303072203996347]
Self-supervised acoustic pre-training has achieved amazing results on the automatic speech recognition (ASR) task.
Most of the successful acoustic pre-training methods use contrastive learning to learn the acoustic representations.
In this letter, we design a novel consistency contrastive learning (CCL) method by utilizing data augmentation for acoustic pre-training.
arXiv Detail & Related papers (2021-12-23T13:23:17Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Acoustic Structure Inverse Design and Optimization Using Deep Learning [8.574112262676335]
In this work, an acoustic structure design method is proposed based on deep learning.
We experimentally demonstrate the effectiveness of the proposed method.
Our method is more efficient, universal and automatic, which has a wide range of potential applications.
arXiv Detail & Related papers (2021-01-29T10:43:51Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.