Perceive and predict: self-supervised speech representation based loss
functions for speech enhancement
- URL: http://arxiv.org/abs/2301.04388v3
- Date: Mon, 26 Jun 2023 09:31:53 GMT
- Title: Perceive and predict: self-supervised speech representation based loss
functions for speech enhancement
- Authors: George Close, William Ravenscroft, Thomas Hain and Stefan Goetze
- Abstract summary: It is shown that the distance between the feature encodings of clean and noisy speech correlate strongly with psychoacoustically motivated measures of speech quality and intelligibility.
Experiments using this distance as a loss function are performed and improved performance over the use of STFT spectrogram distance based loss.
- Score: 23.974815078687445
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work in the domain of speech enhancement has explored the use of
self-supervised speech representations to aid in the training of neural speech
enhancement models. However, much of this work focuses on using the deepest or
final outputs of self supervised speech representation models, rather than the
earlier feature encodings. The use of self supervised representations in such a
way is often not fully motivated. In this work it is shown that the distance
between the feature encodings of clean and noisy speech correlate strongly with
psychoacoustically motivated measures of speech quality and intelligibility, as
well as with human Mean Opinion Score (MOS) ratings. Experiments using this
distance as a loss function are performed and improved performance over the use
of STFT spectrogram distance based loss as well as other common loss functions
from speech enhancement literature is demonstrated using objective measures
such as perceptual evaluation of speech quality (PESQ) and short-time objective
intelligibility (STOI).
Related papers
- Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - The Effect of Spoken Language on Speech Enhancement using
Self-Supervised Speech Representation Loss Functions [21.237026538221404]
This work looks at the relationship between the language of the audio used to train self-supervised representation and that used to train the SE system.
Enhancement models trained using a loss function which incorporates a self-supervised representation that shares exactly the language of the noisy data used to train the SE system show better performance than those which do not match exactly.
It is found that the training language of the self-supervised representation appears to have a minor effect on enhancement performance.
arXiv Detail & Related papers (2023-07-27T09:20:38Z) - On the Behavior of Intrusive and Non-intrusive Speech Enhancement
Metrics in Predictive and Generative Settings [14.734454356396157]
We evaluate the performance of the same speech enhancement backbone trained under predictive and generative paradigms.
We show that intrusive and non-intrusive measures correlate differently for each paradigm.
arXiv Detail & Related papers (2023-06-05T16:30:17Z) - SPADE: Self-supervised Pretraining for Acoustic DisEntanglement [2.294014185517203]
We introduce a self-supervised approach to disentangle room acoustics from speech.
Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
arXiv Detail & Related papers (2023-02-03T01:36:38Z) - Improving Distortion Robustness of Self-supervised Speech Processing
Tasks with Domain Adaptation [60.26511271597065]
Speech distortions are a long-standing problem that degrades the performance of supervisely trained speech processing models.
It is high time that we enhance the robustness of speech processing models to obtain good performance when encountering speech distortions.
arXiv Detail & Related papers (2022-03-30T07:25:52Z) - Towards Intelligibility-Oriented Audio-Visual Speech Enhancement [8.19144665585397]
We present a fully convolutional AV SE model that uses a modified short-time objective intelligibility (STOI) metric as a training cost function.
Our proposed I-O AV SE framework outperforms audio-only (AO) and AV models trained with conventional distance-based loss functions.
arXiv Detail & Related papers (2021-11-18T11:47:37Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Does Visual Self-Supervision Improve Learning of Speech Representations
for Emotion Recognition? [63.564385139097624]
This work investigates visual self-supervision via face reconstruction to guide the learning of audio representations.
We show that a multi-task combination of the proposed visual and audio self-supervision is beneficial for learning richer features.
We evaluate our learned audio representations for discrete emotion recognition, continuous affect recognition and automatic speech recognition.
arXiv Detail & Related papers (2020-05-04T11:33:40Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.