Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models
- URL: http://arxiv.org/abs/2509.17523v1
- Date: Mon, 22 Sep 2025 08:48:04 GMT
- Title: Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models
- Authors: María Andrea Cruz Blandón, Zakaria Aldeneh, Jie Chi, Maureen de Seyssel,
- Abstract summary: Self-supervised learning (SSL) has made significant advances in speech representation learning.<n> multilingual SSL models tend to underperform their monolingual counterparts on each individual language.<n>We introduce limited visual grounding into bilingual speech SSL models to reduce this performance gap.
- Score: 8.556170587475146
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning (SSL) has made significant advances in speech representation learning. Models like wav2vec 2.0 and HuBERT have achieved state-of-the-art results in tasks such as speech recognition, particularly in monolingual settings. However, multilingual SSL models tend to underperform their monolingual counterparts on each individual language, especially in multilingual scenarios with few languages such as the bilingual setting. In this work, we investigate a novel approach to reduce this performance gap by introducing limited visual grounding into bilingual speech SSL models. Our results show that visual grounding benefits both monolingual and bilingual models, with especially pronounced gains for the latter, reducing the multilingual performance gap on zero-shot phonetic discrimination from 31.5% for audio-only models to 8.04% with grounding.
Related papers
- Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech
Models via Language-Specific Experts [14.999359332108767]
We propose DistilWhisper to bridge the performance gap in ASR for under-represented languages.
Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2.
Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters.
arXiv Detail & Related papers (2023-11-02T08:37:30Z) - Hindi as a Second Language: Improving Visually Grounded Speech with
Semantically Similar Samples [89.16814518860357]
The objective of this work is to explore the learning of visually grounded speech models (VGS) from multilingual perspective.
Our key contribution in this work is to leverage the power of a high-resource language in a bilingual visually grounded speech model to improve the performance of a low-resource language.
arXiv Detail & Related papers (2023-03-30T16:34:10Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - Distilling a Pretrained Language Model to a Multilingual ASR Model [3.4012007729454816]
We distill the rich knowledge embedded inside a well-trained teacher text model to the student speech model.
We show the superiority of our method on 20 low-resource languages of the CommonVoice dataset with less than 100 hours of speech data.
arXiv Detail & Related papers (2022-06-25T12:36:11Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.