MT4SSL: Boosting Self-Supervised Speech Representation Learning by
Integrating Multiple Targets
- URL: http://arxiv.org/abs/2211.07321v3
- Date: Wed, 31 May 2023 11:45:38 GMT
- Title: MT4SSL: Boosting Self-Supervised Speech Representation Learning by
Integrating Multiple Targets
- Authors: Ziyang Ma, Zhisheng Zheng, Changli Tang, Yujin Wang, Xie Chen
- Abstract summary: We provide a new perspective on self-supervised speech models from how the training targets are obtained.
We propose a new multi-tasking learning framework for self-supervised learning, MT4SSL.
Our model outperforms previous SSL methods by nontrivial margins on the LibriSpeech benchmark.
- Score: 6.238268985570237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we provide a new perspective on self-supervised speech models
from how the training targets are obtained. We generalize the targets extractor
into Offline Targets Extractor (Off-TE) and Online Targets Extractor (On-TE).
Based on this, we propose a new multi-tasking learning framework for
self-supervised learning, MT4SSL, which stands for Boosting Self-Supervised
Speech Representation Learning by Integrating Multiple Targets. MT4SSL uses the
K-means algorithm as an Off-TE and a teacher network without gradients as an
On-TE, respectively. Our model outperforms previous SSL methods by nontrivial
margins on the LibriSpeech benchmark, and is comparable to or even better than
the best-performing models with fewer data. Furthermore, we find that using
both Off-TE and On-TE results in better convergence in the pre-training phase.
With both effectiveness and efficiency, we think doing multi-task learning on
self-supervised speech models from our perspective is a promising trend.
Related papers
- Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters [65.15700861265432]
We present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models.
Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters.
To preserve the zero-shot recognition capability of vision-language models, we introduce a Distribution Discriminative Auto-Selector.
arXiv Detail & Related papers (2024-03-18T08:00:23Z) - Mispronunciation detection using self-supervised speech representations [10.010024759851142]
We study the use of SSL models for the task of mispronunciation detection for second language learners.
We compare two downstream approaches: 1) training the model for phone recognition using native English data, and 2) training a model directly for the target task using non-native English data.
arXiv Detail & Related papers (2023-07-30T21:20:58Z) - Pushing the Limits of Unsupervised Unit Discovery for SSL Speech
Representation [12.506633315768832]
HuBERT is a successful example that utilizes offline clustering to convert speech features into discrete units for a masked language modeling pretext task.
We present an unsupervised method to improve SSL targets.
Two models are proposed, MonoBERT and PolyBERT, which leverage context-independent and context-dependent phoneme-based units for pre-training.
arXiv Detail & Related papers (2023-06-15T07:45:12Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Meta-Learning with Self-Improving Momentum Target [72.98879709228981]
We propose Self-improving Momentum Target (SiMT) to improve the performance of a meta-learner.
SiMT generates the target model by adapting from the temporal ensemble of the meta-learner.
We show that SiMT brings a significant performance gain when combined with a wide range of meta-learning methods.
arXiv Detail & Related papers (2022-10-11T06:45:15Z) - Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language
Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks.
Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients.
We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Multilingual Speech Recognition using Knowledge Transfer across Learning
Processes [15.927513451432946]
Experimental results reveal the best pre-training strategy resulting in 3.55% relative reduction in overall WER.
A combination of LEAP and SSL yields 3.51% relative reduction in overall WER when using language ID.
arXiv Detail & Related papers (2021-10-15T07:50:27Z) - UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction.
Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance.
We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.