SPADE: Self-supervised Pretraining for Acoustic DisEntanglement
- URL: http://arxiv.org/abs/2302.01483v1
- Date: Fri, 3 Feb 2023 01:36:38 GMT
- Title: SPADE: Self-supervised Pretraining for Acoustic DisEntanglement
- Authors: John Harvill, Jarred Barber, Arun Nair, Ramin Pishehvar
- Abstract summary: We introduce a self-supervised approach to disentangle room acoustics from speech.
Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
- Score: 2.294014185517203
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised representation learning approaches have grown in popularity
due to the ability to train models on large amounts of unlabeled data and have
demonstrated success in diverse fields such as natural language processing,
computer vision, and speech. Previous self-supervised work in the speech domain
has disentangled multiple attributes of speech such as linguistic content,
speaker identity, and rhythm. In this work, we introduce a self-supervised
approach to disentangle room acoustics from speech and use the acoustic
representation on the downstream task of device arbitration. Our results
demonstrate that our proposed approach significantly improves performance over
a baseline when labeled training data is scarce, indicating that our
pretraining scheme learns to encode room acoustic information while remaining
invariant to other attributes of the speech signal.
Related papers
- Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT [10.18337180909434]
Self-supervised speech representation learning has become essential for extracting meaningful features from untranscribed audio.
We propose a speech-only self-supervised fine-tuning approach that separates syllabic units from speaker information.
arXiv Detail & Related papers (2024-09-16T09:07:08Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Adversarial Representation Learning for Robust Privacy Preservation in
Audio [11.409577482625053]
Sound event detection systems may inadvertently reveal sensitive information about users or their surroundings.
We propose a novel adversarial training method for learning representations of audio recordings.
The proposed method is evaluated against a baseline approach with no privacy measures and a prior adversarial training method.
arXiv Detail & Related papers (2023-04-29T08:39:55Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - Supervised Acoustic Embeddings And Their Transferability Across
Languages [2.28438857884398]
In speech recognition, it is essential to model the phonetic content of the input signal while discarding irrelevant factors such as speaker variations and noise.
Self-supervised pre-training has been proposed as a way to improve both supervised and unsupervised speech recognition.
arXiv Detail & Related papers (2023-01-03T09:37:24Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - Augmentation adversarial training for self-supervised speaker
recognition [49.47756927090593]
We train robust speaker recognition models without speaker labels.
Experiments on VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision.
arXiv Detail & Related papers (2020-07-23T15:49:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.