Learning Cross-lingual Visual Speech Representations
- URL: http://arxiv.org/abs/2303.09455v1
- Date: Tue, 14 Mar 2023 17:05:08 GMT
- Title: Learning Cross-lingual Visual Speech Representations
- Authors: Andreas Zinonos, Alexandros Haliassos, Pingchuan Ma, Stavros Petridis,
Maja Pantic
- Abstract summary: Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
- Score: 108.68531445641769
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-lingual self-supervised learning has been a growing research topic in
the last few years. However, current works only explored the use of audio
signals to create representations. In this work, we study cross-lingual
self-supervised visual representation learning. We use the recently-proposed
Raw Audio-Visual Speech Encoders (RAVEn) framework to pre-train an audio-visual
model with unlabelled multilingual data, and then fine-tune the visual model on
labelled transcriptions. Our experiments show that: (1) multi-lingual models
with more data outperform monolingual ones, but, when keeping the amount of
data fixed, monolingual models tend to reach better performance; (2)
multi-lingual outperforms English-only pre-training; (3) using languages which
are more similar yields better results; and (4) fine-tuning on unseen languages
is competitive to using the target language in the pre-training set. We hope
our study inspires future research on non-English-only speech representation
learning.
Related papers
- ViSpeR: Multilingual Audio-Visual Speech Recognition [9.40993779729177]
This work presents an extensive and detailed study on Audio-Visual Speech Recognition for five widely spoken languages.
We have collected large-scale datasets for each language except for English, and have engaged in the training of supervised learning models.
Our model, ViSpeR, is trained in a multi-lingual setting, resulting in competitive performance on newly established benchmarks for each language.
arXiv Detail & Related papers (2024-05-27T14:48:51Z) - Multilingual Turn-taking Prediction Using Voice Activity Projection [25.094622033971643]
This paper investigates the application of voice activity projection (VAP), a predictive turn-taking model for spoken dialogue, on multilingual data.
The results show that a monolingual VAP model trained on one language does not make good predictions when applied to other languages.
A multilingual model, trained on all three languages, demonstrates predictive performance on par with monolingual models across all languages.
arXiv Detail & Related papers (2024-03-11T07:50:29Z) - Adapting Multilingual Speech Representation Model for a New,
Underresourced Language through Multilingual Fine-tuning and Continued
Pretraining [2.3513645401551333]
We investigate the possibility for adapting an existing multilingual wav2vec 2.0 model for a new language.
Our results show that continued pretraining is the most effective method to adapt a wav2vec 2.0 model for a new language.
We find that if a model pretrained on a related speech variety or an unrelated language with similar phonological characteristics is available, multilingual fine-tuning using additional data from that language can have positive impact on speech recognition performance.
arXiv Detail & Related papers (2023-01-18T03:57:53Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z) - Multilingual Jointly Trained Acoustic and Written Word Embeddings [22.63696520064212]
We extend this idea to multiple low-resource languages.
We jointly train an AWE model and an AGWE model, using phonetically transcribed data from multiple languages.
The pre-trained models can then be used for unseen zero-resource languages, or fine-tuned on data from low-resource languages.
arXiv Detail & Related papers (2020-06-24T19:16:02Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.