MetricGAN-U: Unsupervised speech enhancement/ dereverberation based only
on noisy/ reverberated speech
- URL: http://arxiv.org/abs/2110.05866v1
- Date: Tue, 12 Oct 2021 10:01:32 GMT
- Title: MetricGAN-U: Unsupervised speech enhancement/ dereverberation based only
on noisy/ reverberated speech
- Authors: Szu-Wei Fu, Cheng Yu, Kuo-Hsuan Hung, Mirco Ravanelli, Yu Tsao
- Abstract summary: We propose MetricGAN-U, which releases the constraint from conventional unsupervised learning.
In MetricGAN-U, only noisy speech is required to train the model by optimizing non-intrusive speech quality metrics.
The experimental results verified that MetricGAN-U outperforms baselines in both objective and subjective metrics.
- Score: 28.012465936987013
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most of the deep learning-based speech enhancement models are learned in a
supervised manner, which implies that pairs of noisy and clean speech are
required during training. Consequently, several noisy speeches recorded in
daily life cannot be used to train the model. Although certain unsupervised
learning frameworks have also been proposed to solve the pair constraint, they
still require clean speech or noise for training. Therefore, in this paper, we
propose MetricGAN-U, which stands for MetricGAN-unsupervised, to further
release the constraint from conventional unsupervised learning. In MetricGAN-U,
only noisy speech is required to train the model by optimizing non-intrusive
speech quality metrics. The experimental results verified that MetricGAN-U
outperforms baselines in both objective and subjective metrics.
Related papers
- Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - Pushing the Limits of Unsupervised Unit Discovery for SSL Speech
Representation [12.506633315768832]
HuBERT is a successful example that utilizes offline clustering to convert speech features into discrete units for a masked language modeling pretext task.
We present an unsupervised method to improve SSL targets.
Two models are proposed, MonoBERT and PolyBERT, which leverage context-independent and context-dependent phoneme-based units for pre-training.
arXiv Detail & Related papers (2023-06-15T07:45:12Z) - Supervised Acoustic Embeddings And Their Transferability Across
Languages [2.28438857884398]
In speech recognition, it is essential to model the phonetic content of the input signal while discarding irrelevant factors such as speaker variations and noise.
Self-supervised pre-training has been proposed as a way to improve both supervised and unsupervised speech recognition.
arXiv Detail & Related papers (2023-01-03T09:37:24Z) - A Training and Inference Strategy Using Noisy and Enhanced Speech as
Target for Speech Enhancement without Clean Speech [24.036987059698415]
We propose a training and inference strategy that additionally uses enhanced speech as a target.
Because homogeneity between in-domain noise and extraneous noise is the key to the effectiveness of NyTT, we train various student models by remixing.
Experimental results show that our proposed method outperforms several baselines.
arXiv Detail & Related papers (2022-10-27T12:26:24Z) - Simple and Effective Unsupervised Speech Translation [68.25022245914363]
We study a simple and effective approach to build speech translation systems without labeled data.
We present an unsupervised domain adaptation technique for pre-trained speech models.
Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art.
arXiv Detail & Related papers (2022-10-18T22:26:13Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Test-Time Adaptation Toward Personalized Speech Enhancement: Zero-Shot
Learning with Knowledge Distillation [26.39206098000297]
We propose a novel personalized speech enhancement method to adapt a compact denoising model to the test-time specificity.
Our goal in this test-time adaptation is to utilize no clean speech target of the test speaker.
Instead of the missing clean utterance target, we distill the more advanced denoising results from an overly large teacher model.
arXiv Detail & Related papers (2021-05-08T00:42:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.