Speech separation with large-scale self-supervised learning
- URL: http://arxiv.org/abs/2211.05172v1
- Date: Wed, 9 Nov 2022 20:00:21 GMT
- Title: Speech separation with large-scale self-supervised learning
- Authors: Zhuo Chen, Naoyuki Kanda, Jian Wu, Yu Wu, Xiaofei Wang, Takuya
Yoshioka, Jinyu Li, Sunit Sivasankaran, Sefik Emre Eskimez
- Abstract summary: Self-supervised learning (SSL) methods such as WavLM have shown promising speech separation (SS) results in small-scale simulation-based experiments.
We extend the exploration of the SSL-based SS by massively scaling up both the pre-training data (more than 300K hours) and fine-tuning data (10K hours)
- Score: 41.96634125460265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning (SSL) methods such as WavLM have shown promising
speech separation (SS) results in small-scale simulation-based experiments. In
this work, we extend the exploration of the SSL-based SS by massively scaling
up both the pre-training data (more than 300K hours) and fine-tuning data (10K
hours). We also investigate various techniques to efficiently integrate the
pre-trained model with the SS network under a limited computation budget,
including a low frame rate SSL model training setup and a fine-tuning scheme
using only the part of the pre-trained model. Compared with a supervised
baseline and the WavLM-based SS model using feature embeddings obtained with
the previously released 94K hours trained WavLM, our proposed model obtains
15.9% and 11.2% of relative word error rate (WER) reductions, respectively, for
a simulated far-field speech mixture test set. For conversation transcription
on real meeting recordings using continuous speech separation, the proposed
model achieves 6.8% and 10.6% of relative WER reductions over the purely
supervised baseline on AMI and ICSI evaluation sets, respectively, while
reducing the computational cost by 38%.
Related papers
- Training Large ASR Encoders with Differential Privacy [18.624449993983106]
Self-supervised learning (SSL) methods for large speech models have proven to be highly effective at ASR.
With the interest in public deployment of large pre-trained models, there is a rising concern for unintended memorization and leakage of sensitive data points from the training data.
This paper is the first to apply differentially private (DP) pre-training to a SOTA Conformer-based encoder, and study its performance on a downstream ASR task assuming the fine-tuning data is public.
arXiv Detail & Related papers (2024-09-21T00:01:49Z) - On Pretraining Data Diversity for Self-Supervised Learning [57.91495006862553]
We explore the impact of training with more diverse datasets on the performance of self-supervised learning (SSL) under a fixed computational budget.
Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal.
arXiv Detail & Related papers (2024-03-20T17:59:58Z) - Joint Prediction and Denoising for Large-scale Multilingual
Self-supervised Learning [69.77973092264338]
We show that more powerful techniques can lead to more efficient pre-training, opening SSL to more research groups.
We propose WavLabLM, which extends WavLM's joint prediction and denoising to 40k hours of data across 136 languages.
We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data.
arXiv Detail & Related papers (2023-09-26T23:55:57Z) - Self-Supervised Pretraining Improves Performance and Inference
Efficiency in Multiple Lung Ultrasound Interpretation Tasks [65.23740556896654]
We investigated whether self-supervised pretraining could produce a neural network feature extractor applicable to multiple classification tasks in lung ultrasound analysis.
When fine-tuning on three lung ultrasound tasks, pretrained models resulted in an improvement of the average across-task area under the receiver operating curve (AUC) by 0.032 and 0.061 on local and external test sets respectively.
arXiv Detail & Related papers (2023-09-05T21:36:42Z) - MiniSUPERB: Lightweight Benchmark for Self-supervised Speech Models [90.99663022952498]
SuperB was proposed to evaluate the generalizability of self-supervised learning (SSL) speech models across various tasks.
SuperB incurs high computational costs due to the large datasets and diverse tasks.
We introduce MiniSUPERB, a lightweight benchmark that efficiently evaluates SSL speech models with comparable results to SUPERB but lower computational costs significantly.
arXiv Detail & Related papers (2023-05-30T13:07:33Z) - MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module [3.42658286826597]
We present MooseNet, a trainable speech metric that predicts the listeners' Mean Opinion Score (MOS)
We propose a novel approach where the Probabilistic Linear Discriminative Analysis (PLDA) generative model is used on top of an embedding.
We show that PLDA works well with a non-finetuned SSL model when trained only on 136 utterances.
arXiv Detail & Related papers (2023-01-17T18:53:15Z) - BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning
for Automatic Speech Recognition [126.5605160882849]
We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency.
We report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks.
arXiv Detail & Related papers (2021-09-27T17:59:19Z) - Exploiting Large-scale Teacher-Student Training for On-device Acoustic
Models [15.237992590162593]
We present results from Alexa speech teams on semi-supervised learning (SSL) of acoustic models (AM)
We discuss SSL for AMs in a small footprint setting, showing that a smaller capacity model trained with 1 million hours of unsupervised data can outperform a baseline supervised system by 14.3% word error rate reduction (WERR)
We then switch to SSL using larger student models in low data regimes; while learning efficiency with unsupervised data is higher, student models may outperform teacher models in such a setting.
arXiv Detail & Related papers (2021-06-11T02:23:40Z) - Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting
Transcription with Single Distant Microphone [43.77139614544301]
Transcribing meetings containing overlapped speech with only a single distant microphone (SDM) has been one of the most challenging problems for automatic speech recognition (ASR)
In this paper, we extensively investigate a two-step approach where we first pre-train a serialized output training (SOT)-based multi-talker ASR.
With fine-tuning on the 70 hours of the AMI-SDM training data, our SOT ASR model achieves a word error rate (WER) of 21.2% for the AMI-SDM evaluation set.
arXiv Detail & Related papers (2021-03-31T02:43:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.