WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing
- URL: http://arxiv.org/abs/2110.13900v1
- Date: Tue, 26 Oct 2021 17:55:19 GMT
- Title: WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing
- Authors: Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo
Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long
Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Micheal Zeng, Furu Wei
- Abstract summary: We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
- Score: 102.45426364965887
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning (SSL) achieves great success in speech recognition,
while limited exploration has been attempted for other speech processing tasks.
As speech signal contains multi-faceted information including speaker identity,
paralinguistics, spoken content, etc., learning universal representations for
all speech tasks is challenging. In this paper, we propose a new pre-trained
model, WavLM, to solve full-stack downstream speech tasks. WavLM is built based
on the HuBERT framework, with an emphasis on both spoken content modeling and
speaker identity preservation. We first equip the Transformer structure with
gated relative position bias to improve its capability on recognition tasks.
For better speaker discrimination, we propose an utterance mixing training
strategy, where additional overlapped utterances are created unsupervisely and
incorporated during model training. Lastly, we scale up the training dataset
from 60k hours to 94k hours of public audio data, and optimize its training
procedure for better representation extraction. WavLM Large achieves
state-of-the-art performance on the SUPERB benchmark, and brings significant
improvements for various speech processing tasks on their representative
benchmarks.
Related papers
- SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - SPADE: Self-supervised Pretraining for Acoustic DisEntanglement [2.294014185517203]
We introduce a self-supervised approach to disentangle room acoustics from speech.
Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
arXiv Detail & Related papers (2023-02-03T01:36:38Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text
Joint Pre-Training [33.02912456062474]
We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech.
We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST2 speech translation.
arXiv Detail & Related papers (2021-10-20T00:59:36Z) - Personalized Speech Enhancement: New Models and Comprehensive Evaluation [27.572537325449158]
We propose two neural networks for personalized speech enhancement (PSE) models that achieve superior performance to the previously proposed VoiceFilter.
We also create test sets that capture a variety of scenarios that users can encounter during video conferencing.
Our results show that the proposed models can yield better speech recognition accuracy, speech intelligibility, and perceptual quality than the baseline models.
arXiv Detail & Related papers (2021-10-18T21:21:23Z) - Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks [20.316239155843963]
We propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks.
On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset.
arXiv Detail & Related papers (2021-10-14T12:32:40Z) - UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction.
Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance.
We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.