Informed Source Extraction With Application to Acoustic Echo Reduction
- URL: http://arxiv.org/abs/2011.04569v4
- Date: Tue, 26 Oct 2021 14:51:15 GMT
- Title: Informed Source Extraction With Application to Acoustic Echo Reduction
- Authors: Mohamed Elminshawi, Wolfgang Mack, and Emanu\"el A. P. Habets
- Abstract summary: deep learning methods leverage a speaker discriminative model that maps a reference snippet uttered by the target speaker into a single embedding vector.
We propose a time-varying source discriminative model that captures the temporal dynamics of the reference signal.
Experimental results demonstrate that the proposed method significantly improves the extraction performance when applied in an acoustic echo reduction scenario.
- Score: 8.296684637620553
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Informed speaker extraction aims to extract a target speech signal from a
mixture of sources given prior knowledge about the desired speaker. Recent deep
learning-based methods leverage a speaker discriminative model that maps a
reference snippet uttered by the target speaker into a single embedding vector
that encapsulates the characteristics of the target speaker. However, such
modeling deliberately neglects the time-varying properties of the reference
signal. In this work, we assume that a reference signal is available that is
temporally correlated with the target signal. To take this correlation into
account, we propose a time-varying source discriminative model that captures
the temporal dynamics of the reference signal. We also show that existing
methods and the proposed method can be generalized to non-speech sources as
well. Experimental results demonstrate that the proposed method significantly
improves the extraction performance when applied in an acoustic echo reduction
scenario.
Related papers
- A Hybrid Model for Weakly-Supervised Speech Dereverberation [2.731944614640173]
This paper introduces a new training strategy to improve speech dereverberation systems using minimal acoustic information and reverberant (wet) speech.
Experimental results demonstrate that our method achieves more consistent performance across various objective metrics used in speech dereverberation than the state-of-the-art.
arXiv Detail & Related papers (2025-02-06T09:21:22Z) - Acoustic-based 3D Human Pose Estimation Robust to Human Position [16.0759003139539]
The existing active acoustic sensing-based approach for 3D human pose estimation implicitly assumes that the target user is positioned along a line between loudspeakers and a microphone.
Because reflection and diffraction of sound by the human body cause subtle acoustic signal changes compared to sound obstruction, the existing model degrades its accuracy significantly when subjects deviate from this line.
To overcome this limitation, we propose a novel method composed of a position discriminator and reverberation-resistant model.
arXiv Detail & Related papers (2024-11-08T15:56:12Z) - Diffusion Posterior Sampling for Informed Single-Channel Dereverberation [15.16865739526702]
We present an informed single-channel dereverberation method based on conditional generation with diffusion models.
With knowledge of the room impulse response, the anechoic utterance is generated via reverse diffusion.
The proposed approach is largely more robust to measurement noise compared to a state-of-the-art informed single-channel dereverberation method.
arXiv Detail & Related papers (2023-06-21T14:14:05Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - Joint speaker diarisation and tracking in switching state-space model [51.58295550366401]
This paper proposes to explicitly track the movements of speakers while jointly performing diarisation within a unified model.
A state-space model is proposed, where the hidden state expresses the identity of the current active speaker and the predicted locations of all speakers.
Experiments on a Microsoft rich meeting transcription task show that the proposed joint location tracking and diarisation approach is able to perform comparably with other methods that use location information.
arXiv Detail & Related papers (2021-09-23T04:43:58Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Personalized Keyphrase Detection using Speaker and Environment
Information [24.766475943042202]
We introduce a streaming keyphrase detection system that can be easily customized to accurately detect any phrase composed of words from a large vocabulary.
The system is implemented with an end-to-end trained automatic speech recognition (ASR) model and a text-independent speaker verification model.
arXiv Detail & Related papers (2021-04-28T18:50:19Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Open-set Short Utterance Forensic Speaker Verification using
Teacher-Student Network with Explicit Inductive Bias [59.788358876316295]
We propose a pipeline solution to improve speaker verification on a small actual forensic field dataset.
By leveraging large-scale out-of-domain datasets, a knowledge distillation based objective function is proposed for teacher-student learning.
We show that the proposed objective function can efficiently improve the performance of teacher-student learning on short utterances.
arXiv Detail & Related papers (2020-09-21T00:58:40Z) - Dereverberation using joint estimation of dry speech signal and acoustic
system [3.5131188669634885]
Speech dereverberation aims to remove quality-degrading effects of a time-invariant impulse response filter from the signal.
In this report, we describe an approach to speech dereverberation that involves joint estimation of the dry speech signal and of the room impulse response.
arXiv Detail & Related papers (2020-07-24T15:33:08Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.