Maestro-U: Leveraging joint speech-text representation learning for zero
supervised speech ASR
- URL: http://arxiv.org/abs/2210.10027v1
- Date: Tue, 18 Oct 2022 17:50:31 GMT
- Title: Maestro-U: Leveraging joint speech-text representation learning for zero
supervised speech ASR
- Authors: Zhehuai Chen, Ankur Bapna, Andrew Rosenberg, Yu Zhang, Bhuvana
Ramabhadran, Pedro Moreno, Nanxin Chen
- Abstract summary: We show that a modality-matched joint speech and text model can be leveraged to train a massively multilingual ASR model without any supervised speech for some languages.
We show that Maestro-U can promote knowledge transfer from languages with supervised speech even when there is limited to no graphemic overlap.
- Score: 39.59611707268663
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training state-of-the-art Automated Speech Recognition (ASR) models typically
requires a substantial amount of transcribed speech. In this work, we
demonstrate that a modality-matched joint speech and text model can be
leveraged to train a massively multilingual ASR model without any supervised
(manually transcribed) speech for some languages. This paper explores the use
of jointly learnt speech and text representations in a massively multilingual,
zero supervised speech, real-world setting to expand the set of languages
covered by ASR with only unlabeled speech and text in the target languages.
Using the FLEURS dataset, we define the task to cover $102$ languages, where
transcribed speech is available in $52$ of these languages and can be used to
improve end-to-end ASR quality on the remaining $50$. First, we show that by
combining speech representations with byte-level text representations and use
of language embeddings, we can dramatically reduce the Character Error Rate
(CER) on languages with no supervised speech from 64.8\% to 30.8\%, a relative
reduction of 53\%. Second, using a subset of South Asian languages we show that
Maestro-U can promote knowledge transfer from languages with supervised speech
even when there is limited to no graphemic overlap. Overall, Maestro-U closes
the gap to oracle performance by 68.5\% relative and reduces the CER of 19
languages below 15\%.
Related papers
- ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup
for Visual Speech Translation and Recognition [51.412413996510814]
We propose MixSpeech, a cross-modality self-learning framework that utilizes audio speech to regularize the training of visual speech tasks.
MixSpeech enhances speech translation in noisy environments, improving BLEU scores for four languages on AVMuST-TED by +1.4 to +4.2.
arXiv Detail & Related papers (2023-03-09T14:58:29Z) - Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [76.95115818308918]
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages.
This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages.
We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks.
arXiv Detail & Related papers (2023-03-02T07:47:18Z) - Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised
Learning for Text-To-Speech [37.942466944970704]
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models.
To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets.
Experimental evaluation shows that multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages.
arXiv Detail & Related papers (2022-10-27T14:09:48Z) - MAESTRO: Matched Speech Text Representations through Modality Matching [35.566604806335626]
Maestro is a self-supervised training method to unify representations learnt from speech and text modalities.
We establish a new state-of-the-art (SOTA) on VoxPopuli multilingual ASR with a 11% relative reduction in Word Error Rate (WER)
We establish a new state-of-the-art (SOTA) on CoVoST 2 with an improvement of 2.8 BLEU averaged over 21 languages.
arXiv Detail & Related papers (2022-04-07T12:48:16Z) - CLSRIL-23: Cross Lingual Speech Representations for Indic Languages [0.0]
CLSRIL-23 is a self supervised learning based model which learns cross lingual speech representations from raw audio across 23 Indic languages.
It is built on top of wav2vec 2.0 which is solved by training a contrastive task over masked latent speech representations.
We compare the language wise loss during pretraining to compare effects of monolingual and multilingual pretraining.
arXiv Detail & Related papers (2021-07-15T15:42:43Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.