XLS-R: Self-supervised Cross-lingual Speech Representation Learning at
Scale
- URL: http://arxiv.org/abs/2111.09296v2
- Date: Fri, 19 Nov 2021 01:34:26 GMT
- Title: XLS-R: Self-supervised Cross-lingual Speech Representation Learning at
Scale
- Authors: Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong
Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan
Pino, Alexei Baevski, Alexis Conneau, Michael Auli
- Abstract summary: XLS-R is a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0.
We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages.
- Score: 48.0390317915984
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents XLS-R, a large-scale model for cross-lingual speech
representation learning based on wav2vec 2.0. We train models with up to 2B
parameters on nearly half a million hours of publicly available speech audio in
128 languages, an order of magnitude more public data than the largest known
prior work. Our evaluation covers a wide range of tasks, domains, data regimes
and languages, both high and low-resource. On the CoVoST-2 speech translation
benchmark, we improve the previous state of the art by an average of 7.4 BLEU
over 21 translation directions into English. For speech recognition, XLS-R
improves over the best known prior work on BABEL, MLS, CommonVoice as well as
VoxPopuli, lowering error rates by 14-34% relative on average. XLS-R also sets
a new state of the art on VoxLingua107 language identification. Moreover, we
show that with sufficient model size, cross-lingual pretraining can outperform
English-only pretraining when translating English speech into other languages,
a setting which favors monolingual pretraining. We hope XLS-R can help to
improve speech processing tasks for many more languages of the world.
Related papers
- XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception [62.660135152900615]
Speech recognition and translation systems perform poorly on noisy inputs.
XLAVS-R is a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation.
arXiv Detail & Related papers (2024-03-21T13:52:17Z) - DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model [16.31307448314024]
We propose DistilXLSR, a distilled cross-lingual speech representation model.
By randomly shuffling the phonemes of existing speech, we reduce the linguistic information and distill cross-lingual models using only English data.
Our method is proven to be generalizable to various languages/teacher models and has the potential to improve the cross-lingual performance of the English pre-trained models.
arXiv Detail & Related papers (2023-06-02T07:03:06Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Distilling a Pretrained Language Model to a Multilingual ASR Model [3.4012007729454816]
We distill the rich knowledge embedded inside a well-trained teacher text model to the student speech model.
We show the superiority of our method on 20 low-resource languages of the CommonVoice dataset with less than 100 hours of speech data.
arXiv Detail & Related papers (2022-06-25T12:36:11Z) - CLSRIL-23: Cross Lingual Speech Representations for Indic Languages [0.0]
CLSRIL-23 is a self supervised learning based model which learns cross lingual speech representations from raw audio across 23 Indic languages.
It is built on top of wav2vec 2.0 which is solved by training a contrastive task over masked latent speech representations.
We compare the language wise loss during pretraining to compare effects of monolingual and multilingual pretraining.
arXiv Detail & Related papers (2021-07-15T15:42:43Z) - Multilingual Speech Translation with Efficient Finetuning of Pretrained
Models [82.22294901727933]
A minimalistic LNA (LayerNorm and Attention) finetuning can achieve zero-shot crosslingual and cross-modality transfer ability.
Our approach demonstrates strong zero-shot performance in a many-to-many multilingual model.
arXiv Detail & Related papers (2020-10-24T08:15:08Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.