Self-Supervised Learning from Contrastive Mixtures for Personalized
Speech Enhancement
- URL: http://arxiv.org/abs/2011.03426v2
- Date: Tue, 9 Aug 2022 18:24:58 GMT
- Title: Self-Supervised Learning from Contrastive Mixtures for Personalized
Speech Enhancement
- Authors: Aswin Sivaraman and Minje Kim
- Abstract summary: This work explores how self-supervised learning can be universally used to discover speaker-specific features.
We develop a simple contrastive learning procedure which treats the abundant noisy data as makeshift training targets.
- Score: 19.645016575334786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work explores how self-supervised learning can be universally used to
discover speaker-specific features towards enabling personalized speech
enhancement models. We specifically address the few-shot learning scenario
where access to cleaning recordings of a test-time speaker is limited to a few
seconds, but noisy recordings of the speaker are abundant. We develop a simple
contrastive learning procedure which treats the abundant noisy data as
makeshift training targets through pairwise noise injection: the model is
pretrained to maximize agreement between pairs of differently deformed
identical utterances and to minimize agreement between pairs of similarly
deformed nonidentical utterances. Our experiments compare the proposed
pretraining approach with two baseline alternatives: speaker-agnostic
fully-supervised pretraining, and speaker-specific self-supervised pretraining
without contrastive loss terms. Of all three approaches, the proposed method
using contrastive mixtures is found to be most robust to model compression
(using 85% fewer parameters) and reduced clean speech (requiring only 3
seconds).
Related papers
- Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System [73.34663391495616]
We propose a pioneering approach to tackle joint multi-talker and target-talker speech recognition tasks.
Specifically, we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers.
We deliver acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
arXiv Detail & Related papers (2024-07-13T09:28:24Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - End-to-End Speech Recognition and Disfluency Removal with Acoustic
Language Model Pretraining [0.0]
We revisit the performance comparison between two-stage and end-to-end model.
We find that audio based language models pretrained using weak self-supervised objectives match or exceed the performance of similarly trained two-stage models.
arXiv Detail & Related papers (2023-09-08T17:12:14Z) - Pre-Finetuning for Few-Shot Emotional Speech Recognition [20.894029832911617]
We view speaker adaptation as a few-shot learning problem.
We propose pre-finetuning speech models on difficult tasks to distill knowledge into few-shot downstream classification objectives.
arXiv Detail & Related papers (2023-02-24T22:38:54Z) - Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models [95.97506031821217]
We present a novel way of conditioning a pretrained denoising diffusion speech model to produce speech in the voice of a novel person unseen during training.
The method requires a short (3 seconds) sample from the target person, and generation is steered at inference time, without any training steps.
arXiv Detail & Related papers (2022-06-05T19:45:29Z) - On monoaural speech enhancement for automatic recognition of real noisy
speech using mixture invariant training [33.79711018198589]
We extend the existing mixture invariant training criterion to exploit both unpaired clean speech and real noisy data.
It is found that the unpaired clean speech is crucial to improve quality of separated speech from real noisy speech.
The proposed method also performs remixing of processed and unprocessed signals to alleviate the processing artifacts.
arXiv Detail & Related papers (2022-05-03T19:37:58Z) - Self-supervised Speaker Recognition Training Using Human-Machine
Dialogues [22.262550043863445]
We investigate how to pretrain speaker recognition models by leveraging dialogues between customers and smart-speaker devices.
We propose an effective rejection mechanism that selectively learns from dialogues based on their acoustic homogeneity.
Experiments demonstrate that the proposed method provides significant performance improvements, superior to earlier work.
arXiv Detail & Related papers (2022-02-07T19:44:54Z) - Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning [57.4036085386653]
We show that prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inferences based on lexical overlap.
We then show that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning.
arXiv Detail & Related papers (2021-09-09T10:10:29Z) - Test-Time Adaptation Toward Personalized Speech Enhancement: Zero-Shot
Learning with Knowledge Distillation [26.39206098000297]
We propose a novel personalized speech enhancement method to adapt a compact denoising model to the test-time specificity.
Our goal in this test-time adaptation is to utilize no clean speech target of the test speaker.
Instead of the missing clean utterance target, we distill the more advanced denoising results from an overly large teacher model.
arXiv Detail & Related papers (2021-05-08T00:42:03Z) - Bayesian Learning for Deep Neural Network Adaptation [57.70991105736059]
A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences.
Model-based speaker adaptation approaches often require sufficient amounts of target speaker data to ensure robustness.
This paper proposes a full Bayesian learning based DNN speaker adaptation framework to model speaker-dependent (SD) parameter uncertainty.
arXiv Detail & Related papers (2020-12-14T12:30:41Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.