Test-Time Adaptation Toward Personalized Speech Enhancement: Zero-Shot
Learning with Knowledge Distillation
- URL: http://arxiv.org/abs/2105.03544v1
- Date: Sat, 8 May 2021 00:42:03 GMT
- Title: Test-Time Adaptation Toward Personalized Speech Enhancement: Zero-Shot
Learning with Knowledge Distillation
- Authors: Sunwoo Kim and Minje Kim
- Abstract summary: We propose a novel personalized speech enhancement method to adapt a compact denoising model to the test-time specificity.
Our goal in this test-time adaptation is to utilize no clean speech target of the test speaker.
Instead of the missing clean utterance target, we distill the more advanced denoising results from an overly large teacher model.
- Score: 26.39206098000297
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In realistic speech enhancement settings for end-user devices, we often
encounter only a few speakers and noise types that tend to reoccur in the
specific acoustic environment. We propose a novel personalized speech
enhancement method to adapt a compact denoising model to the test-time
specificity. Our goal in this test-time adaptation is to utilize no clean
speech target of the test speaker, thus fulfilling the requirement for
zero-shot learning. To complement the lack of clean utterance, we employ the
knowledge distillation framework. Instead of the missing clean utterance
target, we distill the more advanced denoising results from an overly large
teacher model, and use it as the pseudo target to train the small student
model. This zero-shot learning procedure circumvents the process of collecting
users' clean speech, a process that users are reluctant to comply due to
privacy concerns and technical difficulty of recording clean voice. Experiments
on various test-time conditions show that the proposed personalization method
achieves significant performance gains compared to larger baseline networks
trained from a large speaker- and noise-agnostic datasets. In addition, since
the compact personalized models can outperform larger general-purpose models,
we claim that the proposed method performs model compression with no loss of
denoising performance.
Related papers
- Self-supervised Pretraining for Robust Personalized Voice Activity
Detection in Adverse Conditions [0.0]
We pretrain a long short-term memory (LSTM)-encoder using the autoregressive predictive coding framework.
We also propose a denoising variant of APC, with the goal of improving the robustness of personalized VAD.
Our experiments show that self-supervised pretraining not only improves performance in clean conditions, but also yields models which are more robust to adverse conditions.
arXiv Detail & Related papers (2023-12-27T15:36:17Z) - Combating Label Noise With A General Surrogate Model For Sample
Selection [84.61367781175984]
We propose to leverage the vision-language surrogate model CLIP to filter noisy samples automatically.
We validate the effectiveness of our proposed method on both real-world and synthetic noisy datasets.
arXiv Detail & Related papers (2023-10-16T14:43:27Z) - Continuous Modeling of the Denoising Process for Speech Enhancement
Based on Deep Learning [61.787485727134424]
We use a state variable to indicate the denoising process.
A UNet-like neural network learns to estimate every state variable sampled from the continuous denoising process.
Experimental results indicate that preserving a small amount of noise in the clean target benefits speech enhancement.
arXiv Detail & Related papers (2023-09-17T13:27:11Z) - A Training and Inference Strategy Using Noisy and Enhanced Speech as
Target for Speech Enhancement without Clean Speech [24.036987059698415]
We propose a training and inference strategy that additionally uses enhanced speech as a target.
Because homogeneity between in-domain noise and extraneous noise is the key to the effectiveness of NyTT, we train various student models by remixing.
Experimental results show that our proposed method outperforms several baselines.
arXiv Detail & Related papers (2022-10-27T12:26:24Z) - Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models [95.97506031821217]
We present a novel way of conditioning a pretrained denoising diffusion speech model to produce speech in the voice of a novel person unseen during training.
The method requires a short (3 seconds) sample from the target person, and generation is steered at inference time, without any training steps.
arXiv Detail & Related papers (2022-06-05T19:45:29Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step.
We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Personalized Speech Enhancement through Self-Supervised Data
Augmentation and Purification [24.596224536399326]
We train an SNR predictor model to estimate the frame-by-frame SNR of the pseudo-sources.
We empirically show that the proposed data purification step improves the usability of the speaker-specific noisy data.
arXiv Detail & Related papers (2021-04-05T17:17:55Z) - Self-Supervised Learning for Personalized Speech Enhancement [25.05285328404576]
Speech enhancement systems can show improved performance by adapting the model towards a single test-time speaker.
Test-time user might only provide a small amount of noise-free speech data, likely insufficient for traditional fully-supervised learning.
We propose self-supervised methods that are designed specifically to learn personalized and discriminative features from abundant in-the-wild noisy, but still personal speech recordings.
arXiv Detail & Related papers (2021-04-05T17:12:51Z) - Self-Supervised Learning from Contrastive Mixtures for Personalized
Speech Enhancement [19.645016575334786]
This work explores how self-supervised learning can be universally used to discover speaker-specific features.
We develop a simple contrastive learning procedure which treats the abundant noisy data as makeshift training targets.
arXiv Detail & Related papers (2020-11-06T15:21:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.