Unsupervised Cross-lingual Representation Learning for Speech
Recognition
- URL: http://arxiv.org/abs/2006.13979v2
- Date: Tue, 15 Dec 2020 23:19:19 GMT
- Title: Unsupervised Cross-lingual Representation Learning for Speech
Recognition
- Authors: Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed,
Michael Auli
- Abstract summary: XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
- Score: 63.85924123692923
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents XLSR which learns cross-lingual speech representations by
pretraining a single model from the raw waveform of speech in multiple
languages. We build on wav2vec 2.0 which is trained by solving a contrastive
task over masked latent speech representations and jointly learns a
quantization of the latents shared across languages. The resulting model is
fine-tuned on labeled data and experiments show that cross-lingual pretraining
significantly outperforms monolingual pretraining. On the CommonVoice
benchmark, XLSR shows a relative phoneme error rate reduction of 72% compared
to the best known results. On BABEL, our approach improves word error rate by
16% relative compared to a comparable system. Our approach enables a single
multilingual speech recognition model which is competitive to strong individual
models. Analysis shows that the latent discrete speech representations are
shared across languages with increased sharing for related languages. We hope
to catalyze research in low-resource speech understanding by releasing XLSR-53,
a large model pretrained in 53 languages.
Related papers
- A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives [2.3592914313389257]
We are comparing monolingual Wav2Vec 2.0 models with various multilingual models to see whether we could improve speech recognition performance.
Our results suggest that monolingual speech recognition models are, in most cases, superior to multilingual models.
arXiv Detail & Related papers (2024-07-24T11:03:47Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - Distilling a Pretrained Language Model to a Multilingual ASR Model [3.4012007729454816]
We distill the rich knowledge embedded inside a well-trained teacher text model to the student speech model.
We show the superiority of our method on 20 low-resource languages of the CommonVoice dataset with less than 100 hours of speech data.
arXiv Detail & Related papers (2022-06-25T12:36:11Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - CLSRIL-23: Cross Lingual Speech Representations for Indic Languages [0.0]
CLSRIL-23 is a self supervised learning based model which learns cross lingual speech representations from raw audio across 23 Indic languages.
It is built on top of wav2vec 2.0 which is solved by training a contrastive task over masked latent speech representations.
We compare the language wise loss during pretraining to compare effects of monolingual and multilingual pretraining.
arXiv Detail & Related papers (2021-07-15T15:42:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.