Zero-Shot Personalized Speech Enhancement through Speaker-Informed Model
Selection
- URL: http://arxiv.org/abs/2105.03542v1
- Date: Sat, 8 May 2021 00:15:57 GMT
- Title: Zero-Shot Personalized Speech Enhancement through Speaker-Informed Model
Selection
- Authors: Aswin Sivaraman, Minje Kim
- Abstract summary: optimizing speech towards a particular test-time speaker can improve performance and reduce run-time complexity.
We propose using an ensemble model wherein each specialist module denoises noisy utterances from a distinct partition of training set speakers.
Grouping the training set speakers into non-overlapping semantically similar groups is non-trivial and ill-defined.
- Score: 25.05285328404576
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a novel zero-shot learning approach towards personalized
speech enhancement through the use of a sparsely active ensemble model.
Optimizing speech denoising systems towards a particular test-time speaker can
improve performance and reduce run-time complexity. However, test-time model
adaptation may be challenging if collecting data from the test-time speaker is
not possible. To this end, we propose using an ensemble model wherein each
specialist module denoises noisy utterances from a distinct partition of
training set speakers. The gating module inexpensively estimates test-time
speaker characteristics in the form of an embedding vector and selects the most
appropriate specialist module for denoising the test signal. Grouping the
training set speakers into non-overlapping semantically similar groups is
non-trivial and ill-defined. To do this, we first train a Siamese network using
noisy speech pairs to maximize or minimize the similarity of its output vectors
depending on whether the utterances derive from the same speaker or not. Next,
we perform k-means clustering on the latent space formed by the averaged
embedding vectors per training set speaker. In this way, we designate speaker
groups and train specialist modules optimized around partitions of the complete
training set. Our experiments show that ensemble models made up of low-capacity
specialists can outperform high-capacity generalist models with greater
efficiency and improved adaptation towards unseen test-time speakers.
Related papers
- SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection [7.6732312922460055]
We propose SelectTTS, a novel method to select the appropriate frames from the target speaker and decode using frame-level self-supervised learning (SSL) features.
We show that this approach can effectively capture speaker characteristics for unseen speakers, and achieves comparable results to other multi-speaker text-to-speech frameworks in both objective and subjective metrics.
arXiv Detail & Related papers (2024-08-30T17:34:46Z) - Personalized Speech Enhancement Without a Separate Speaker Embedding Model [3.907450460692904]
We propose to use the internal representation of the PSE model itself as the speaker embedding.
We show that our approach performs equally well or better than the standard method of using a pre-trained speaker embedding model.
arXiv Detail & Related papers (2024-06-14T11:16:46Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - Unsupervised Personalization of an Emotion Recognition System: The
Unique Properties of the Externalization of Valence in Speech [37.6839508524855]
Adapting a speech emotion recognition system to a particular speaker is a hard problem, especially with deep neural networks (DNNs)
This study proposes an unsupervised approach to address this problem by searching for speakers in the train set with similar acoustic patterns as the speaker in the test set.
We propose three alternative adaptation strategies: unique speaker, oversampling and weighting approaches.
arXiv Detail & Related papers (2022-01-19T22:14:49Z) - Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model.
We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z) - Test-Time Adaptation Toward Personalized Speech Enhancement: Zero-Shot
Learning with Knowledge Distillation [26.39206098000297]
We propose a novel personalized speech enhancement method to adapt a compact denoising model to the test-time specificity.
Our goal in this test-time adaptation is to utilize no clean speech target of the test speaker.
Instead of the missing clean utterance target, we distill the more advanced denoising results from an overly large teacher model.
arXiv Detail & Related papers (2021-05-08T00:42:03Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.