Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments
- URL: http://arxiv.org/abs/2502.16611v2
- Date: Tue, 17 Jun 2025 06:10:07 GMT
- Title: Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments
- Authors: Shitong Xu, Yiyuan Yang, Niki Trigoni, Andrew Markham,
- Abstract summary: Previous research has explored extracting the target speaker's characteristics from noisy enrollments.<n>In this work, we explore a novel enrollment strategy that encodes target speaker information from the noisy enrollment.<n> Experiments show the effectiveness of our model architecture, which achieves over 2.1 dB higher SI-SNRi compared to prior works.
- Score: 34.67934887761352
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Target speaker extraction focuses on isolating a specific speaker's voice from an audio mixture containing multiple speakers. To provide information about the target speaker's identity, prior works have utilized clean audio samples as conditioning inputs. However, such clean audio examples are not always readily available. For instance, obtaining a clean recording of a stranger's voice at a cocktail party without leaving the noisy environment is generally infeasible. Limited prior research has explored extracting the target speaker's characteristics from noisy enrollments, which may contain overlapping speech from interfering speakers. In this work, we explore a novel enrollment strategy that encodes target speaker information from the noisy enrollment by comparing segments where the target speaker is talking (Positive Enrollments) with segments where the target speaker is silent (Negative Enrollments). Experiments show the effectiveness of our model architecture, which achieves over 2.1 dB higher SI-SNRi compared to prior works in extracting the monaural speech from the mixture of two speakers. Additionally, the proposed two-stage training strategy accelerates convergence, reducing the number of optimization steps required to reach 3 dB SNR by 60\%. Overall, our method achieves state-of-the-art performance in the monaural target speaker extraction conditioned on noisy enrollments.
Related papers
- Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM [53.17360668423001]
Overlapping Speech Detection (OSD) aims to identify regions where multiple speakers overlap in a conversation.<n>This work proposes a speaker-aware progressive OSD model that leverages a progressive training strategy to enhance the correlation between subtasks.<n> Experimental results show that the proposed method achieves state-of-the-art performance, with an F1 score of 82.76% on the AMI test set.
arXiv Detail & Related papers (2025-05-29T07:47:48Z) - Investigation of Speaker Representation for Target-Speaker Speech Processing [49.110228525976794]
This paper aims to address a fundamental question: what is the preferred speaker embedding for target-speaker speech processing tasks?
For the TS-ASR, TSE, and p-VAD tasks, we compare pre-trained speaker encoders that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from the target speaker's identity in the form of a one-hot vector.
Our analysis reveals speaker verification performance is somewhat unrelated to TS task performances, the one-hot vector outperforms enrollment-based ones, and the optimal embedding depends on the input mixture.
arXiv Detail & Related papers (2024-10-15T03:58:13Z) - Look Once to Hear: Target Speech Hearing with Noisy Examples [14.386776139261004]
In crowded settings, the human brain can focus on speech from a target speaker, given prior knowledge of how they sound.
We introduce a novel intelligent hearable system that achieves this capability, enabling target speech hearing to ignore all interfering speech and noise, but the target speaker.
arXiv Detail & Related papers (2024-05-10T07:44:18Z) - Exploring Speaker-Related Information in Spoken Language Understanding
for Better Speaker Diarization [7.673971221635779]
We propose methods to extract speaker-related information from semantic content in multi-party meetings.
Experiments on both AISHELL-4 and AliMeeting datasets show that our method achieves consistent improvements over acoustic-only speaker diarization systems.
arXiv Detail & Related papers (2023-05-22T11:14:19Z) - In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation.
First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.
We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance.
We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z) - Quantitative Evidence on Overlooked Aspects of Enrollment Speaker
Embeddings for Target Speaker Separation [14.013049471563141]
Single channel target speaker separation aims at extracting a speaker's voice from a mixture of multiple talkers given an enrollment utterance of that speaker.
A typical deep learning TSS framework consists of an upstream model that obtains enrollment speaker embeddings and a downstream model that performs the separation conditioned on the embeddings.
arXiv Detail & Related papers (2022-10-23T07:08:46Z) - Speaker Extraction with Co-Speech Gestures Cue [79.91394239104908]
We explore the use of co-speech gestures sequence, e.g. hand and body movements, as the speaker cue for speaker extraction.
We propose two networks using the co-speech gestures cue to perform attentive listening on the target speaker.
The experimental results show that the co-speech gestures cue is informative in associating the target speaker, and the quality of the extracted speech shows significant improvements over the unprocessed mixture speech.
arXiv Detail & Related papers (2022-03-31T06:48:52Z) - Speaker Generation [16.035697779803627]
This work explores the task of synthesizing speech in nonexistent human-sounding voices.
We present TacoSpawn, a system that performs competitively at this task.
arXiv Detail & Related papers (2021-11-07T22:31:41Z) - PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation
Extraction [90.55375210094995]
Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise.
We propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction.
arXiv Detail & Related papers (2021-10-03T07:05:29Z) - WASE: Learning When to Attend for Speaker Extraction in Cocktail Party
Environments [21.4128321045702]
In the speaker extraction problem, it is found that additional information from the target speaker contributes to the tracking and extraction of the speaker.
Inspired by the cue of sound onset, we explicitly modeled the onset cue and verified the effectiveness in the speaker extraction task.
From the perspective of tasks, our onset/offset-based model completes the composite task, a complementary combination of speaker extraction and speaker-dependent voice activity detection.
arXiv Detail & Related papers (2021-06-13T14:56:05Z) - Multimodal Attention Fusion for Target Speaker Extraction [108.73502348754842]
We propose a novel attention mechanism for multi-modal fusion and its training methods.
Our proposals improve signal to distortion ratio (SDR) by 1.0 dB over conventional fusion mechanisms on simulated data.
arXiv Detail & Related papers (2021-02-02T05:59:35Z) - Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons.
Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z) - FaceFilter: Audio-visual speech separation using still images [41.97445146257419]
This paper aims to separate a target speaker's speech from a mixture of two speakers using a deep audio-visual speech separation network.
Unlike previous works that used lip movement on video clips or pre-enrolled speaker information as an auxiliary conditional feature, we use a single face image of the target speaker.
arXiv Detail & Related papers (2020-05-14T15:42:31Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.