Scaling ASR Improves Zero and Few Shot Learning
- URL: http://arxiv.org/abs/2111.05948v1
- Date: Wed, 10 Nov 2021 21:18:59 GMT
- Title: Scaling ASR Improves Zero and Few Shot Learning
- Authors: Alex Xiao, Weiyi Zheng, Gil Keren, Duc Le, Frank Zhang, Christian
Fuegen, Ozlem Kalinli, Yatharth Saraf, Abdelrahman Mohamed
- Abstract summary: We propose data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets.
By training 1-10B parameter universal English ASR models, we push the limits of speech recognition performance across many domains.
For speakers with disorders due to brain damage, our best zero-shot and few-shot models achieve 22% and 60% relative improvement on the AphasiaBank test set, respectively.
- Score: 23.896440724468246
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With 4.5 million hours of English speech from 10 different sources across 120
countries and models of up to 10 billion parameters, we explore the frontiers
of scale for automatic speech recognition. We propose data selection techniques
to efficiently scale training data to find the most valuable samples in massive
datasets. To efficiently scale model sizes, we leverage various optimizations
such as sparse transducer loss and model sharding. By training 1-10B parameter
universal English ASR models, we push the limits of speech recognition
performance across many domains. Furthermore, our models learn powerful speech
representations with zero and few-shot capabilities on novel domains and styles
of speech, exceeding previous results across multiple in-house and public
benchmarks. For speakers with disorders due to brain damage, our best zero-shot
and few-shot models achieve 22% and 60% relative improvement on the AphasiaBank
test set, respectively, while realizing the best performance on public social
media videos. Furthermore, the same universal model reaches equivalent
performance with 500x less in-domain data on the SPGISpeech financial-domain
dataset.
Related papers
- Automatic Speech Recognition for the Ika Language [0.0]
We fine-tune the pretrained wav2vec 2.0 Massively translations Speech Models on a high-quality speech dataset compiled from New Testament Bible Multilingual in Ika.
Our results show that fine-tuning multilingual pretrained models achieves a Word Error Rate (WER) of 0.5377 and Character Error Rate (CER) of 0.2651 with just over 1 hour of training data.
arXiv Detail & Related papers (2024-10-01T11:56:42Z) - SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition [3.4355593397388597]
Speech emotion recognition (SER) has made significant strides with the advent of powerful self-supervised learning (SSL) models.
We propose a large-scale benchmark to evaluate the robustness and adaptability of state-of-the-art SER models.
We find that the Whisper model, primarily designed for automatic speech recognition, outperforms dedicated SSL models in cross-lingual SER.
arXiv Detail & Related papers (2024-08-14T23:33:10Z) - Towards Robust Speech Representation Learning for Thousands of Languages [77.2890285555615]
Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data.
We propose XEUS, a Cross-lingual for Universal Speech, trained on over 1 million hours of data across 4057 languages.
arXiv Detail & Related papers (2024-06-30T21:40:26Z) - Replay to Remember: Continual Layer-Specific Fine-tuning for German
Speech Recognition [19.635428830237842]
We study how well the performance of large-scale ASR models can be approximated for smaller domains.
We apply Experience Replay for continual learning to increase the robustness of the ASR model to vocabulary and speakers outside of the fine-tuned domain.
arXiv Detail & Related papers (2023-07-14T11:20:22Z) - Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [76.95115818308918]
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages.
This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages.
We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks.
arXiv Detail & Related papers (2023-03-02T07:47:18Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - Visual Speech Recognition for Multiple Languages in the Wild [64.52593130370757]
We show that designing better VSR models is equally important to using larger training sets.
We propose the addition of prediction-based auxiliary tasks to a VSR model.
We show that such model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin.
arXiv Detail & Related papers (2022-02-26T07:21:00Z) - BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning
for Automatic Speech Recognition [126.5605160882849]
We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency.
We report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks.
arXiv Detail & Related papers (2021-09-27T17:59:19Z) - Scaling End-to-End Models for Large-Scale Multilingual ASR [44.89961662796597]
Building ASR models across many language families is a challenging multi-task learning problem due to large language variations and heavily unbalanced data.
We conduct a capacity study on a 15-language task, with the amount of data per language varying from 7.7K to 54.7K hours.
arXiv Detail & Related papers (2021-04-30T08:24:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.