DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model
- URL: http://arxiv.org/abs/2306.01303v1
- Date: Fri, 2 Jun 2023 07:03:06 GMT
- Title: DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model
- Authors: Haoyu Wang, Siyuan Wang, Wei-Qiang Zhang, Jinfeng Bai
- Abstract summary: We propose DistilXLSR, a distilled cross-lingual speech representation model.
By randomly shuffling the phonemes of existing speech, we reduce the linguistic information and distill cross-lingual models using only English data.
Our method is proven to be generalizable to various languages/teacher models and has the potential to improve the cross-lingual performance of the English pre-trained models.
- Score: 16.31307448314024
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multilingual self-supervised speech representation models have greatly
enhanced the speech recognition performance for low-resource languages, and the
compression of these huge models has also become a crucial prerequisite for
their industrial application. In this paper, we propose DistilXLSR, a distilled
cross-lingual speech representation model. By randomly shuffling the phonemes
of existing speech, we reduce the linguistic information and distill
cross-lingual models using only English data. We also design a layer-jumping
initialization method to fully leverage the teacher's pre-trained weights.
Experiments on 2 kinds of teacher models and 15 low-resource languages show
that our method can reduce the parameters by 50% while maintaining
cross-lingual representation ability. Our method is proven to be generalizable
to various languages/teacher models and has the potential to improve the
cross-lingual performance of the English pre-trained models.
Related papers
- Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - Distilling a Pretrained Language Model to a Multilingual ASR Model [3.4012007729454816]
We distill the rich knowledge embedded inside a well-trained teacher text model to the student speech model.
We show the superiority of our method on 20 low-resource languages of the CommonVoice dataset with less than 100 hours of speech data.
arXiv Detail & Related papers (2022-06-25T12:36:11Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - Cross-lingual Visual Pre-training for Multimodal Machine Translation [36.4592103797139]
We combine cross-lingual and visual pre-training methods to learn cross-lingual representations.
We show that when fine-tuned for multimodal machine translation, these models obtain state-of-the-art performance.
arXiv Detail & Related papers (2021-01-25T12:46:41Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.