Improving Target Speaker Extraction with Sparse LDA-transformed Speaker
Embeddings
- URL: http://arxiv.org/abs/2301.06277v1
- Date: Mon, 16 Jan 2023 06:30:48 GMT
- Title: Improving Target Speaker Extraction with Sparse LDA-transformed Speaker
Embeddings
- Authors: Kai Liu, Xucheng Wan, Ziqing Du and Huan Zhou
- Abstract summary: We propose a simplified speaker cue with clear class separability for target speaker extraction.
Our proposal shows up to 9.9% relative improvement in SI-SDRi.
With SI-SDRi of 19.4 dB and PESQ of 3.78, our best TSE system significantly outperforms the current SOTA systems.
- Score: 5.4878772986187565
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a practical alternative of speech separation, target speaker extraction
(TSE) aims to extract the speech from the desired speaker using additional
speaker cue extracted from the speaker. Its main challenge lies in how to
properly extract and leverage the speaker cue to benefit the extracted speech
quality. The cue extraction method adopted in majority existing TSE studies is
to directly utilize discriminative speaker embedding, which is extracted from
the pre-trained models for speaker verification. Although the high speaker
discriminability is a most desirable property for speaker verification task, we
argue that it may be too sophisticated for TSE. In this study, we propose that
a simplified speaker cue with clear class separability might be preferred for
TSE. To verify our proposal, we introduce several forms of speaker cues,
including naive speaker embedding (such as, x-vector and xi-vector) and new
speaker embeddings produced from sparse LDA-transform. Corresponding TSE models
are built by integrating these speaker cues with SepFormer (one SOTA speech
separation model). Performances of these TSE models are examined on the
benchmark WSJ0-2mix dataset. Experimental results validate the effectiveness
and generalizability of our proposal, showing up to 9.9% relative improvement
in SI-SDRi. Moreover, with SI-SDRi of 19.4 dB and PESQ of 3.78, our best TSE
system significantly outperforms the current SOTA systems and offers the top
TSE results reported till date on the WSJ0-2mix.
Related papers
- Investigation of Speaker Representation for Target-Speaker Speech Processing [49.110228525976794]
This paper aims to address a fundamental question: what is the preferred speaker embedding for target-speaker speech processing tasks?
For the TS-ASR, TSE, and p-VAD tasks, we compare pre-trained speaker encoders that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from the target speaker's identity in the form of a one-hot vector.
Our analysis reveals speaker verification performance is somewhat unrelated to TS task performances, the one-hot vector outperforms enrollment-based ones, and the optimal embedding depends on the input mixture.
arXiv Detail & Related papers (2024-10-15T03:58:13Z) - SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection [7.6732312922460055]
We propose SelectTTS, a novel method to select the appropriate frames from the target speaker and decode using frame-level self-supervised learning (SSL) features.
We show that this approach can effectively capture speaker characteristics for unseen speakers, and achieves comparable results to other multi-speaker text-to-speech frameworks in both objective and subjective metrics.
arXiv Detail & Related papers (2024-08-30T17:34:46Z) - Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications [18.151884620928936]
We present a novel study aiming to optimize the use of a Speaker-Attributed ASR (SA-ASR) system in real-life scenarios.
We propose a pipeline tailored to real-life applications involving Voice Activity Detection (VAD), Speaker Diarization (SD), and SA-ASR.
arXiv Detail & Related papers (2024-03-11T10:11:29Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - X-SepFormer: End-to-end Speaker Extraction Network with Explicit
Optimization on Speaker Confusion [5.4878772986187565]
We present an end-to-end TSE model with proposed loss schemes and a backbone of SepFormer.
With SI-SDRi of 19.4 dB and PESQ of 3.81, our best system significantly outperforms the current SOTA systems.
arXiv Detail & Related papers (2023-03-09T04:00:29Z) - AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis.
We model the speaker characteristics systematically to improve the generalization on new speakers.
Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model.
We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z) - GC-TTS: Few-shot Speaker Adaptation with Geometric Constraints [36.07346889498981]
We propose GC-TTS which achieves high-quality speaker adaptation with significantly improved speaker similarity.
A TTS model is pre-trained for base speakers with a sufficient amount of data, and then fine-tuned for novel speakers on a few minutes of data with two geometric constraints.
The experimental results demonstrate that GC-TTS generates high-quality speech from only a few minutes of training data, outperforming standard techniques in terms of speaker similarity to the target speaker.
arXiv Detail & Related papers (2021-08-16T04:25:31Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z) - Speaker-aware speech-transformer [18.017579835663057]
Speech-Transformer (ST) as the study platform to investigate speaker aware training of E2E models.
Speaker-Aware Speech-Transformer (SAST) is a standard ST equipped with a speaker attention module (SAM)
arXiv Detail & Related papers (2020-01-02T15:04:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.