SpeechStew: Simply Mix All Available Speech Recognition Data to Train
One Large Neural Network
- URL: http://arxiv.org/abs/2104.02133v1
- Date: Mon, 5 Apr 2021 20:13:36 GMT
- Title: SpeechStew: Simply Mix All Available Speech Recognition Data to Train
One Large Neural Network
- Authors: William Chan, Daniel Park, Chris Lee, Yu Zhang, Quoc Le, Mohammad
Norouzi
- Abstract summary: We present SpeechStew, a speech recognition model that is trained on a combination of publicly available speech recognition datasets.
Our results include 9.0% WER on AMI-IHM, 4.7% WER on Switchboard, 8.3% WER on CallHome, and 1.3% on WSJ.
We also demonstrate that SpeechStew learns powerful transfer learning representations.
- Score: 45.59907668722702
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present SpeechStew, a speech recognition model that is trained on a
combination of various publicly available speech recognition datasets: AMI,
Broadcast News, Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and
Wall Street Journal. SpeechStew simply mixes all of these datasets together,
without any special re-weighting or re-balancing of the datasets. SpeechStew
achieves SoTA or near SoTA results across a variety of tasks, without the use
of an external language model. Our results include 9.0\% WER on AMI-IHM, 4.7\%
WER on Switchboard, 8.3\% WER on CallHome, and 1.3\% on WSJ, which
significantly outperforms prior work with strong external language models. We
also demonstrate that SpeechStew learns powerful transfer learning
representations. We fine-tune SpeechStew on a noisy low resource speech
dataset, CHiME-6. We achieve 38.9\% WER without a language model, which
compares to 38.6\% WER to a strong HMM baseline with a language model.
Related papers
- Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond [36.660499609887886]
Speech-MASSIVE is a multilingual Spoken Language Understanding dataset.
It covers 12 languages from different families and inherits from the annotations for the intent prediction and slot-filling tasks.
We demonstrate the suitability of Speech-MASSIVE for other tasks such as speech transcription, language identification, and speech translation.
arXiv Detail & Related papers (2024-08-07T16:55:28Z) - GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement [36.29371629234269]
GigaSpeech 2 is a large-scale, multi-domain, multilingual speech recognition corpus.
It comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese.
arXiv Detail & Related papers (2024-06-17T13:44:20Z) - A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition [0.0]
Silent Speech Interfaces (SSIs) offer a noninvasive alternative to brain-computer interfaces for soundless verbal communication.
We introduce Multimodal Orofacial Neural Audio (MONA), a system that leverages cross-modal alignment to train a multimodal model with a shared latent representation.
To the best of our knowledge, this work represents the first instance where noninvasive silent speech recognition on an open vocabulary has cleared the threshold of 15% WER.
arXiv Detail & Related papers (2024-03-02T21:15:24Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - BeAts: Bengali Speech Acts Recognition using Multimodal Attention Fusion [0.0]
We develop a novel approach combining two models, wav2vec2.0 for audio and MarianMT for text translation, to predict speech acts.
We also show that our model BeAts ($underlinetextbfBe$ngali speech acts recognition using Multimodal $underlinetextbfAt$tention Fu$underlinetextbfs$ion.
arXiv Detail & Related papers (2023-06-05T08:12:17Z) - Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot
Task Generalization [61.60501633397704]
We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering.
We design task-specific prompts, by either leveraging another large-scale model, or simply manipulating the special tokens in the default prompts.
Experiments show that our proposed prompts improve performance by 10% to 45% on the three zero-shot tasks, and even outperform SotA supervised models on some datasets.
arXiv Detail & Related papers (2023-05-18T16:32:58Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.