Improving Distortion Robustness of Self-supervised Speech Processing
Tasks with Domain Adaptation
- URL: http://arxiv.org/abs/2203.16104v1
- Date: Wed, 30 Mar 2022 07:25:52 GMT
- Title: Improving Distortion Robustness of Self-supervised Speech Processing
Tasks with Domain Adaptation
- Authors: Kuan Po Huang, Yu-Kuan Fu, Yu Zhang, Hung-yi Lee
- Abstract summary: Speech distortions are a long-standing problem that degrades the performance of supervisely trained speech processing models.
It is high time that we enhance the robustness of speech processing models to obtain good performance when encountering speech distortions.
- Score: 60.26511271597065
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Speech distortions are a long-standing problem that degrades the performance
of supervisely trained speech processing models. It is high time that we
enhance the robustness of speech processing models to obtain good performance
when encountering speech distortions while not hurting the original performance
on clean speech. In this work, we propose to improve the robustness of speech
processing models by domain adversarial training (DAT). We conducted
experiments based on the SUPERB framework on five different speech processing
tasks. In case we do not always have knowledge of the distortion types for
speech data, we analyzed the binary-domain and multi-domain settings, where the
former treats all distorted speech as one domain, and the latter views
different distortions as different domains. In contrast to supervised training
methods, we obtained promising results in target domains where speech data is
distorted with different distortions including new unseen distortions
introduced during testing.
Related papers
- Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge [19.810337081901178]
Supervised models for speech enhancement are trained using artificially generated mixtures of clean speech and noise signals.
This discrepancy can result in poor performance when the test domain significantly differs from the synthetic training domain.
The UDASE task of the 7th CHiME challenge aimed to leverage real-world noisy speech recordings from the test domain.
arXiv Detail & Related papers (2024-02-02T13:45:42Z) - Diffusion-based speech enhancement with a weighted generative-supervised
learning loss [0.0]
Diffusion-based generative models have recently gained attention in speech enhancement (SE)
We propose augmenting the original diffusion training objective with a mean squared error (MSE) loss, measuring the discrepancy between estimated enhanced speech and ground-truth clean speech.
arXiv Detail & Related papers (2023-09-19T09:13:35Z) - Continuous Modeling of the Denoising Process for Speech Enhancement
Based on Deep Learning [61.787485727134424]
We use a state variable to indicate the denoising process.
A UNet-like neural network learns to estimate every state variable sampled from the continuous denoising process.
Experimental results indicate that preserving a small amount of noise in the clean target benefits speech enhancement.
arXiv Detail & Related papers (2023-09-17T13:27:11Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Unsupervised Noise adaptation using Data Simulation [21.866522173387715]
We propose a generative adversarial network based method to efficiently learn a converse clean-to-noisy transformation.
Experimental results show that our method effectively mitigates the domain mismatch between training and test sets.
arXiv Detail & Related papers (2023-02-23T12:57:20Z) - Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition [66.94463981654216]
We propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive Visual Speech Recognition (VSR)
We finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters.
The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases.
arXiv Detail & Related papers (2023-02-16T06:01:31Z) - SPADE: Self-supervised Pretraining for Acoustic DisEntanglement [2.294014185517203]
We introduce a self-supervised approach to disentangle room acoustics from speech.
Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
arXiv Detail & Related papers (2023-02-03T01:36:38Z) - On monoaural speech enhancement for automatic recognition of real noisy
speech using mixture invariant training [33.79711018198589]
We extend the existing mixture invariant training criterion to exploit both unpaired clean speech and real noisy data.
It is found that the unpaired clean speech is crucial to improve quality of separated speech from real noisy speech.
The proposed method also performs remixing of processed and unprocessed signals to alleviate the processing artifacts.
arXiv Detail & Related papers (2022-05-03T19:37:58Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.