Self-Supervised Speech Quality Estimation and Enhancement Using Only
Clean Speech
- URL: http://arxiv.org/abs/2402.16321v1
- Date: Mon, 26 Feb 2024 06:01:38 GMT
- Title: Self-Supervised Speech Quality Estimation and Enhancement Using Only
Clean Speech
- Authors: Szu-Wei Fu, Kuo-Hsuan Hung, Yu Tsao, Yu-Chiang Frank Wang
- Abstract summary: We propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variational autoencoder (VQ-VAE)
The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted.
We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training.
- Score: 50.95292368372455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech quality estimation has recently undergone a paradigm shift from
human-hearing expert designs to machine-learning models. However, current
models rely mainly on supervised learning, which is time-consuming and
expensive for label collection. To solve this problem, we propose VQScore, a
self-supervised metric for evaluating speech based on the quantization error of
a vector-quantized-variational autoencoder (VQ-VAE). The training of VQ-VAE
relies on clean speech; hence, large quantization errors can be expected when
the speech is distorted. To further improve correlation with real quality
scores, domain knowledge of speech processing is incorporated into the model
design. We found that the vector quantization mechanism could also be used for
self-supervised speech enhancement (SE) model training. To improve the
robustness of the encoder for SE, a novel self-distillation mechanism combined
with adversarial training is introduced. In summary, the proposed speech
quality estimation method and enhancement models require only clean speech for
training without any label requirements. Experimental results show that the
proposed VQScore and enhancement model are competitive with supervised
baselines. The code will be released after publication.
Related papers
- RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models [14.07649230604283]
We propose low complexity changes to the quantization aware training (QAT) process to improve model accuracy.
With the improved accuracy, it opens up the possibility to exploit some of the other benefits of noise based QAT.
arXiv Detail & Related papers (2023-05-24T19:45:56Z) - A vector quantized masked autoencoder for audiovisual speech emotion recognition [5.8641712963450825]
The paper proposes the VQ-MAE-AV model, a vector quantized masked autoencoder (MAE) designed for audiovisual speech self-supervised representation learning.
A multimodal MAE with self- or cross-attention mechanisms is proposed to fuse the audio and visual speech modalities and to learn local and global representations of the audiovisual speech sequence.
Experimental results show that the proposed approach, which is pre-trained on the VoxCeleb2 database and fine-tuned on standard emotional audiovisual speech datasets, outperforms the state-of-the-art audiovisual SER methods.
arXiv Detail & Related papers (2023-05-05T14:19:46Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed
Stochastic Quantization [13.075574481614478]
One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction of the full capacity of the codebook.
We propose a new training scheme that extends the standard VAE via novel dequantization and quantization.
Our experiments show that SQ-VAE improves codebook utilization without using commons.
arXiv Detail & Related papers (2022-05-16T09:49:37Z) - Self-supervised Learning with Random-projection Quantizer for Speech
Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition.
The approach learns a model to predict masked speech signals, in the form of discrete labels.
It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z) - Knowledge Distillation for Quality Estimation [79.51452598302934]
Quality Estimation (QE) is the task of automatically predicting Machine Translation quality in the absence of reference translations.
Recent success in QE stems from the use of multilingual pre-trained representations, where very large models lead to impressive results.
We show that this approach, in combination with data augmentation, leads to light-weight QE models that perform competitively with distilled pre-trained representations with 8x fewer parameters.
arXiv Detail & Related papers (2021-07-01T12:36:21Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Vector-quantized neural networks for acoustic unit discovery in the
ZeroSpeech 2020 challenge [26.114011076658237]
We propose two neural models to tackle the problem of learning discrete representations of speech.
The first model is a type of vector-quantized variational autoencoder (VQ-VAE)
The second model combines vector quantization with contrastive predictive coding (VQ-CPC)
We evaluate the models on English and Indonesian data for the ZeroSpeech 2020 challenge.
arXiv Detail & Related papers (2020-05-19T13:06:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.