BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning
for Automatic Speech Recognition
- URL: http://arxiv.org/abs/2109.13226v1
- Date: Mon, 27 Sep 2021 17:59:19 GMT
- Title: BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning
for Automatic Speech Recognition
- Authors: Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor,
Aren Jansen, Yuanzhong Xu, Yanping Huang, Shibo Wang, Zongwei Zhou, Bo Li,
Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai
Sim, Bhuvana Ramabhadran, Tara N. Sainath, Fran\c{c}oise Beaufays, Zhifeng
Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang and Yonghui Wu
- Abstract summary: We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency.
We report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks.
- Score: 126.5605160882849
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We summarize the results of a host of efforts using giant automatic speech
recognition (ASR) models pre-trained using large, diverse unlabeled datasets
containing approximately a million hours of audio. We find that the combination
of pre-training, self-training and scaling up model size greatly increases data
efficiency, even for extremely large tasks with tens of thousands of hours of
labeled data. In particular, on an ASR task with 34k hours of labeled data, by
fine-tuning an 8 billion parameter pre-trained Conformer model we can match
state-of-the-art (SoTA) performance with only 3% of the training data and
significantly improve SoTA with the full training set. We also report on the
universal benefits gained from using big pre-trained and self-trained models
for a large set of downstream tasks that cover a wide range of speech domains
and span multiple orders of magnitudes of dataset sizes, including obtaining
SoTA performance on many public benchmarks. In addition, we utilize the learned
representation of pre-trained networks to achieve SoTA results on non-ASR
tasks.
Related papers
- On Pretraining Data Diversity for Self-Supervised Learning [57.91495006862553]
We explore the impact of training with more diverse datasets on the performance of self-supervised learning (SSL) under a fixed computational budget.
Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal.
arXiv Detail & Related papers (2024-03-20T17:59:58Z) - Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding [9.112203072394648]
Power-law scaling indicates that large-scale training with uniform sampling is prohibitively slow.
Active learning methods aim to increase data efficiency by prioritizing learning on the most relevant examples.
arXiv Detail & Related papers (2023-12-08T19:26:13Z) - D4: Improving LLM Pretraining via Document De-Duplication and
Diversification [38.84592304799403]
We show that careful data selection via pre-trained model embeddings can speed up training.
We also show that repeating data intelligently consistently outperforms baseline training.
arXiv Detail & Related papers (2023-08-23T17:58:14Z) - Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [76.95115818308918]
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages.
This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages.
We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks.
arXiv Detail & Related papers (2023-03-02T07:47:18Z) - Efficient Utilization of Large Pre-Trained Models for Low Resource ASR [31.57758062484189]
We study a challenging low resource conversational telephony speech corpus from the medical domain in Vietnamese and German.
We show the benefits of using unsupervised techniques beyond simple fine-tuning of large pre-trained models.
arXiv Detail & Related papers (2022-10-26T17:34:30Z) - Self-supervised Audiovisual Representation Learning for Remote Sensing Data [96.23611272637943]
We propose a self-supervised approach for pre-training deep neural networks in remote sensing.
By exploiting the correspondence between geo-tagged audio recordings and remote sensing, this is done in a completely label-free manner.
We show that our approach outperforms existing pre-training strategies for remote sensing imagery.
arXiv Detail & Related papers (2021-08-02T07:50:50Z) - Domain-matched Pre-training Tasks for Dense Retrieval [68.07140087626637]
Pre-training on larger datasets with ever increasing model size is now a proven recipe for increased performance across almost all NLP tasks.
We show that, with the right pre-training setup, this barrier can be overcome.
We demonstrate this by pre-training large bi-encoder models on 1) a recently released set of 65 million synthetically generated questions, and 2) 200 million post-comment pairs from a preexisting dataset of Reddit conversations.
arXiv Detail & Related papers (2021-07-28T19:13:00Z) - Pretraining Representations for Data-Efficient Reinforcement Learning [12.43475487724972]
We use unlabeled data to pretrain an encoder which is then finetuned on a small amount of task-specific data.
When limited to 100k steps of interaction on Atari games, our approach significantly surpasses prior work.
Our approach shows particular promise when combined with larger models as well as more diverse, task-aligned observational data.
arXiv Detail & Related papers (2021-06-09T04:14:27Z) - Recognizing More Emotions with Less Data Using Self-supervised Transfer
Learning [0.0]
We propose a novel transfer learning method for speech emotion recognition.
With as low as 125 examples per emotion class, we were able to reach a higher accuracy than a strong baseline trained on 8 times more data.
arXiv Detail & Related papers (2020-11-11T06:18:31Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.