Personalized Speech Enhancement through Self-Supervised Data
Augmentation and Purification
- URL: http://arxiv.org/abs/2104.02018v1
- Date: Mon, 5 Apr 2021 17:17:55 GMT
- Title: Personalized Speech Enhancement through Self-Supervised Data
Augmentation and Purification
- Authors: Aswin Sivaraman, Sunwoo Kim, Minje Kim
- Abstract summary: We train an SNR predictor model to estimate the frame-by-frame SNR of the pseudo-sources.
We empirically show that the proposed data purification step improves the usability of the speaker-specific noisy data.
- Score: 24.596224536399326
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training personalized speech enhancement models is innately a no-shot
learning problem due to privacy constraints and limited access to noise-free
speech from the target user. If there is an abundance of unlabeled noisy speech
from the test-time user, a personalized speech enhancement model can be trained
using self-supervised learning. One straightforward approach to model
personalization is to use the target speaker's noisy recordings as
pseudo-sources. Then, a pseudo denoising model learns to remove injected
training noises and recover the pseudo-sources. However, this approach is
volatile as it depends on the quality of the pseudo-sources, which may be too
noisy. As a remedy, we propose an improvement to the self-supervised approach
through data purification. We first train an SNR predictor model to estimate
the frame-by-frame SNR of the pseudo-sources. Then, the predictor's estimates
are converted into weights which adjust the frame-by-frame contribution of the
pseudo-sources towards training the personalized model. We empirically show
that the proposed data purification step improves the usability of the
speaker-specific noisy data in the context of personalized speech enhancement.
Without relying on any clean speech recordings or speaker embeddings, our
approach may be seen as privacy-preserving.
Related papers
- Self-Supervised Speech Quality Estimation and Enhancement Using Only
Clean Speech [50.95292368372455]
We propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variational autoencoder (VQ-VAE)
The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted.
We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training.
arXiv Detail & Related papers (2024-02-26T06:01:38Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - Self-supervised Pretraining for Robust Personalized Voice Activity
Detection in Adverse Conditions [0.0]
We pretrain a long short-term memory (LSTM)-encoder using the autoregressive predictive coding framework.
We also propose a denoising variant of APC, with the goal of improving the robustness of personalized VAD.
Our experiments show that self-supervised pretraining not only improves performance in clean conditions, but also yields models which are more robust to adverse conditions.
arXiv Detail & Related papers (2023-12-27T15:36:17Z) - Continuous Modeling of the Denoising Process for Speech Enhancement
Based on Deep Learning [61.787485727134424]
We use a state variable to indicate the denoising process.
A UNet-like neural network learns to estimate every state variable sampled from the continuous denoising process.
Experimental results indicate that preserving a small amount of noise in the clean target benefits speech enhancement.
arXiv Detail & Related papers (2023-09-17T13:27:11Z) - Adversarial Representation Learning for Robust Privacy Preservation in
Audio [11.409577482625053]
Sound event detection systems may inadvertently reveal sensitive information about users or their surroundings.
We propose a novel adversarial training method for learning representations of audio recordings.
The proposed method is evaluated against a baseline approach with no privacy measures and a prior adversarial training method.
arXiv Detail & Related papers (2023-04-29T08:39:55Z) - SPADE: Self-supervised Pretraining for Acoustic DisEntanglement [2.294014185517203]
We introduce a self-supervised approach to disentangle room acoustics from speech.
Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
arXiv Detail & Related papers (2023-02-03T01:36:38Z) - An Ensemble Teacher-Student Learning Approach with Poisson Sub-sampling
to Differential Privacy Preserving Speech Recognition [51.20130423303659]
We propose an ensemble learning framework with Poisson sub-sampling to train a collection of teacher models to issue some differential privacy (DP) guarantee for training data.
Through boosting under DP, a student model derived from the training data suffers little model degradation from the models trained with no privacy protection.
Our proposed solution leverages upon two mechanisms, namely: (i) a privacy budget amplification via Poisson sub-sampling to train a target prediction model that requires less noise to achieve a same level of privacy budget, and (ii) a combination of the sub-sampling technique and an ensemble teacher-student learning framework.
arXiv Detail & Related papers (2022-10-12T16:34:08Z) - An Experimental Study on Private Aggregation of Teacher Ensemble
Learning for End-to-End Speech Recognition [51.232523987916636]
Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data.
In this work, we extend PATE learning to work with dynamic patterns, namely speech, and perform one very first experimental study on ASR to avoid acoustic data leakage.
arXiv Detail & Related papers (2022-10-11T16:55:54Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Test-Time Adaptation Toward Personalized Speech Enhancement: Zero-Shot
Learning with Knowledge Distillation [26.39206098000297]
We propose a novel personalized speech enhancement method to adapt a compact denoising model to the test-time specificity.
Our goal in this test-time adaptation is to utilize no clean speech target of the test speaker.
Instead of the missing clean utterance target, we distill the more advanced denoising results from an overly large teacher model.
arXiv Detail & Related papers (2021-05-08T00:42:03Z) - Self-Supervised Learning for Personalized Speech Enhancement [25.05285328404576]
Speech enhancement systems can show improved performance by adapting the model towards a single test-time speaker.
Test-time user might only provide a small amount of noise-free speech data, likely insufficient for traditional fully-supervised learning.
We propose self-supervised methods that are designed specifically to learn personalized and discriminative features from abundant in-the-wild noisy, but still personal speech recordings.
arXiv Detail & Related papers (2021-04-05T17:12:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.