UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training
- URL: http://arxiv.org/abs/2110.05752v1
- Date: Tue, 12 Oct 2021 05:43:30 GMT
- Title: UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training
- Authors: Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie
Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu
- Abstract summary: Two methods are introduced for enhancing the unsupervised speaker information extraction.
Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance.
We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
- Score: 72.004873454347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning (SSL) is a long-standing goal for speech processing,
since it utilizes large-scale unlabeled data and avoids extensive human
labeling. Recent years witness great successes in applying self-supervised
learning in speech recognition, while limited exploration was attempted in
applying SSL for modeling speaker characteristics. In this paper, we aim to
improve the existing SSL framework for speaker representation learning. Two
methods are introduced for enhancing the unsupervised speaker information
extraction. First, we apply the multi-task learning to the current SSL
framework, where we integrate the utterance-wise contrastive loss with the SSL
objective function. Second, for better speaker discrimination, we propose an
utterance mixing strategy for data augmentation, where additional overlapped
utterances are created unsupervisely and incorporate during training. We
integrate the proposed methods into the HuBERT framework. Experiment results on
SUPERB benchmark show that the proposed system achieves state-of-the-art
performance in universal representation learning, especially for speaker
identification oriented tasks. An ablation study is performed verifying the
efficacy of each proposed method. Finally, we scale up training dataset to 94
thousand hours public audio data and achieve further performance improvement in
all SUPERB tasks.
Related papers
- SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Why does Self-Supervised Learning for Speech Recognition Benefit Speaker
Recognition? [86.53044183309824]
We study which factor leads to the success of self-supervised learning on speaker-related tasks.
Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size.
arXiv Detail & Related papers (2022-04-27T08:35:57Z) - Audio Self-supervised Learning: A Survey [60.41768569891083]
Self-Supervised Learning (SSL) targets at discovering general representations from large-scale data without requiring human annotations.
Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing.
arXiv Detail & Related papers (2022-03-02T15:58:29Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks [20.316239155843963]
We propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks.
On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset.
arXiv Detail & Related papers (2021-10-14T12:32:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.