The Effect of Spoken Language on Speech Enhancement using
Self-Supervised Speech Representation Loss Functions
- URL: http://arxiv.org/abs/2307.14502v2
- Date: Fri, 20 Oct 2023 08:55:17 GMT
- Title: The Effect of Spoken Language on Speech Enhancement using
Self-Supervised Speech Representation Loss Functions
- Authors: George Close, Thomas Hain and Stefan Goetze
- Abstract summary: This work looks at the relationship between the language of the audio used to train self-supervised representation and that used to train the SE system.
Enhancement models trained using a loss function which incorporates a self-supervised representation that shares exactly the language of the noisy data used to train the SE system show better performance than those which do not match exactly.
It is found that the training language of the self-supervised representation appears to have a minor effect on enhancement performance.
- Score: 21.237026538221404
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work in the field of speech enhancement (SE) has involved the use of
self-supervised speech representations (SSSRs) as feature transformations in
loss functions. However, in prior work, very little attention has been paid to
the relationship between the language of the audio used to train the
self-supervised representation and that used to train the SE system.
Enhancement models trained using a loss function which incorporates a
self-supervised representation that shares exactly the language of the noisy
data used to train the SE system show better performance than those which do
not match exactly. This may lead to enhancement systems which are language
specific and as such do not generalise well to unseen languages, unlike models
trained using traditional spectrogram or time domain loss functions. In this
work, SE models are trained and tested on a number of different languages, with
self-supervised representations which themselves are trained using different
language combinations and with differing network structures as loss function
representations. These models are then tested across unseen languages and their
performances are analysed. It is found that the training language of the
self-supervised representation appears to have a minor effect on enhancement
performance, the amount of training data of a particular language, however,
greatly affects performance.
Related papers
- Prune or Retrain: Optimizing the Vocabulary of Multilingual Models for Estonian [0.19116784879310028]
modifying the vocabulary of a multilingual encoder model to better suit the Estonian language affects its downstream performance.
We evaluate the effectiveness of two vocabulary adaptation approaches -- retraining the tokenizer and pruning unused tokens.
arXiv Detail & Related papers (2025-01-05T19:21:45Z) - An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - Exploring the Benefits of Tokenization of Discrete Acoustic Units [4.591279524925446]
Tokenization algorithms merge the units of a base vocabulary into larger, variable-rate units.
We demonstrate that tokenization yields significant improvements in terms of performance, as well as training and inference speed.
arXiv Detail & Related papers (2024-06-08T18:34:28Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - SPADE: Self-supervised Pretraining for Acoustic DisEntanglement [2.294014185517203]
We introduce a self-supervised approach to disentangle room acoustics from speech.
Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
arXiv Detail & Related papers (2023-02-03T01:36:38Z) - Perceive and predict: self-supervised speech representation based loss
functions for speech enhancement [23.974815078687445]
It is shown that the distance between the feature encodings of clean and noisy speech correlate strongly with psychoacoustically motivated measures of speech quality and intelligibility.
Experiments using this distance as a loss function are performed and improved performance over the use of STFT spectrogram distance based loss.
arXiv Detail & Related papers (2023-01-11T10:20:56Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.