Supervised Acoustic Embeddings And Their Transferability Across
Languages
- URL: http://arxiv.org/abs/2301.01020v1
- Date: Tue, 3 Jan 2023 09:37:24 GMT
- Title: Supervised Acoustic Embeddings And Their Transferability Across
Languages
- Authors: Sreepratha Ram and Hanan Aldarmaki
- Abstract summary: In speech recognition, it is essential to model the phonetic content of the input signal while discarding irrelevant factors such as speaker variations and noise.
Self-supervised pre-training has been proposed as a way to improve both supervised and unsupervised speech recognition.
- Score: 2.28438857884398
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In speech recognition, it is essential to model the phonetic content of the
input signal while discarding irrelevant factors such as speaker variations and
noise, which is challenging in low-resource settings. Self-supervised
pre-training has been proposed as a way to improve both supervised and
unsupervised speech recognition, including frame-level feature representations
and Acoustic Word Embeddings (AWE) for variable-length segments. However,
self-supervised models alone cannot learn perfect separation of the linguistic
content as they are trained to optimize indirect objectives. In this work, we
experiment with different pre-trained self-supervised features as input to AWE
models and show that they work best within a supervised framework. Models
trained on English can be transferred to other languages with no adaptation and
outperform self-supervised models trained solely on the target languages.
Related papers
- Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - SPADE: Self-supervised Pretraining for Acoustic DisEntanglement [2.294014185517203]
We introduce a self-supervised approach to disentangle room acoustics from speech.
Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
arXiv Detail & Related papers (2023-02-03T01:36:38Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Training Robust Zero-Shot Voice Conversion Models with Self-supervised
Features [24.182732872327183]
Unsampling Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker.
We show that high-quality audio samples can be achieved by using a length resupervised decoder.
arXiv Detail & Related papers (2021-12-08T17:27:39Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - MetricGAN-U: Unsupervised speech enhancement/ dereverberation based only
on noisy/ reverberated speech [28.012465936987013]
We propose MetricGAN-U, which releases the constraint from conventional unsupervised learning.
In MetricGAN-U, only noisy speech is required to train the model by optimizing non-intrusive speech quality metrics.
The experimental results verified that MetricGAN-U outperforms baselines in both objective and subjective metrics.
arXiv Detail & Related papers (2021-10-12T10:01:32Z) - Injecting Text and Cross-lingual Supervision in Few-shot Learning from
Self-Supervised Models [33.66135770490531]
We show how universal phoneset acoustic models can leverage cross-lingual supervision to improve transfer of self-supervised representations to new languages.
We also show how target-language text can be used to enable and improve fine-tuning with the lattice-free maximum mutual information objective.
arXiv Detail & Related papers (2021-10-10T17:33:44Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.