Why does Self-Supervised Learning for Speech Recognition Benefit Speaker
Recognition?
- URL: http://arxiv.org/abs/2204.12765v1
- Date: Wed, 27 Apr 2022 08:35:57 GMT
- Title: Why does Self-Supervised Learning for Speech Recognition Benefit Speaker
Recognition?
- Authors: Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Zhuo Chen, Peidong
Wang, Gang Liu, Jinyu Li, Jian Wu, Xiangzhan Yu, Furu Wei
- Abstract summary: We study which factor leads to the success of self-supervised learning on speaker-related tasks.
Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size.
- Score: 86.53044183309824
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, self-supervised learning (SSL) has demonstrated strong performance
in speaker recognition, even if the pre-training objective is designed for
speech recognition. In this paper, we study which factor leads to the success
of self-supervised learning on speaker-related tasks, e.g. speaker verification
(SV), through a series of carefully designed experiments. Our empirical results
on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a
combination of mask speech prediction loss, data scale, and model size, while
the SSL quantizer has a minor impact. We further employ the integrated
gradients attribution method and loss landscape visualization to understand the
effectiveness of self-supervised learning for speaker recognition performance.
Related papers
- CochCeps-Augment: A Novel Self-Supervised Contrastive Learning Using
Cochlear Cepstrum-based Masking for Speech Emotion Recognition [5.974778743092437]
CochCeps-Augment is a novel bio-inspired masking augmentation task for self-supervised contrastive learning of speech representations.
Our results potentiate CochCeps-Augment to serve as a standard tool in speech emotion recognition analysis.
arXiv Detail & Related papers (2024-02-10T11:13:13Z) - Self-supervised Neural Factor Analysis for Disentangling Utterance-level
Speech Representations [30.293081541301746]
Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition.
We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective.
Our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.
arXiv Detail & Related papers (2023-05-14T08:26:24Z) - Audio Self-supervised Learning: A Survey [60.41768569891083]
Self-Supervised Learning (SSL) targets at discovering general representations from large-scale data without requiring human annotations.
Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing.
arXiv Detail & Related papers (2022-03-02T15:58:29Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction.
Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance.
We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z) - LeBenchmark: A Reproducible Framework for Assessing Self-Supervised
Representation Learning from Speech [63.84741259993937]
Self-Supervised Learning (SSL) using huge unlabeled data has been successfully explored for image and natural language processing.
Recent works also investigated SSL from speech.
We propose LeBenchmark: a reproducible framework for assessing SSL from speech.
arXiv Detail & Related papers (2021-04-23T08:27:09Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.