Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context
- URL: http://arxiv.org/abs/2404.02000v3
- Date: Mon, 22 Apr 2024 09:18:44 GMT
- Title: Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context
- Authors: Antoine Caubrière, Elodie Gauthier,
- Abstract summary: We present the first self-supervised multilingual speech model trained exclusively on African speech.
The model learned from nearly 60 000 hours of unlabeled speech segments in 21 languages and dialects spoken in sub-Saharan Africa.
- Score: 2.3066058341851816
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the first self-supervised multilingual speech model trained exclusively on African speech. The model learned from nearly 60 000 hours of unlabeled speech segments in 21 languages and dialects spoken in sub-Saharan Africa. On the SSA subset of the FLEURS-102 dataset, our approach based on a HuBERT$_{base}$ (0.09B) architecture shows competitive results, for ASR downstream task, compared to the w2v-bert-51 (0.6B) pre-trained model proposed in the FLEURS benchmark, while being more efficient by using 7x less data and 6x less parameters. Furthermore, in the context of a LID downstream task, our approach outperforms FLEURS baselines accuracy by over 22\%.
Related papers
- Scaling HuBERT for African Languages: From Base to Large and XL [0.5825599299113071]
This work introduces SSA-HuBERT-Large (317M parameters) and SSA-HuBERT-XL (964M parameters)<n>The first large models trained solely on African speech, alongside a BASE size counterpart.<n>By conducting a carefully controlled experimental study focused exclusively on Sub-Saharan languages, we demonstrate that larger architectures significantly improve performance by effectively leveraging large audio datasets.
arXiv Detail & Related papers (2025-11-28T17:17:40Z) - NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025 [24.056321452209666]
This report details the NTU Speechlab system developed for the Interspeech 2025 Multilingual Conversational Speech and Language Model (MLC-SLM) Challenge (Task I)<n>We present comprehensive analyses of our multilingual automatic speech recognition system, highlighting key advancements in model architecture, data selection, and training strategies.
arXiv Detail & Related papers (2025-06-16T10:28:27Z) - Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis [4.774607166378613]
Self-supervised pre-training, combined with large language models (LLM), can effectively enhance ASR performance in low-resource scenarios.<n>We pre-train a Data2vec2 model on 300,000 hours of unlabeled dialect and accented speech data and do alignment training on a supervised dataset of 40,000 hours.
arXiv Detail & Related papers (2025-05-27T12:50:55Z) - Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval [49.1574468325115]
We introduce Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones.<n>Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10.<n>More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M) remain competitive while being over 13x smaller.
arXiv Detail & Related papers (2025-05-25T23:06:20Z) - AfriHuBERT: A self-supervised speech representation model for African languages [44.722780475475915]
We present an extension of mHuBERT-147, a state-of-the-art (SOTA) and compact self-supervised learning (SSL) model, originally pretrained on 147 languages.
While mHuBERT-147 was pretrained on 16 African languages, we expand this to cover 39 African languages through continued pretraining on 6,500+ hours of speech data aggregated from diverse sources.
arXiv Detail & Related papers (2024-09-30T11:28:33Z) - DataComp-LM: In search of the next generation of training sets for language models [200.5293181577585]
DataComp for Language Models (DCLM) is a testbed for controlled dataset experiments with the goal of improving language models.
We provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations.
Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters.
arXiv Detail & Related papers (2024-06-17T17:42:57Z) - Self-supervised Adaptive Pre-training of Multilingual Speech Models for
Language and Dialect Identification [19.893213508284813]
Self-supervised adaptive pre-training is proposed to adapt the pre-trained model to the target domain and languages of the downstream task.
We show that SAPT improves XLSR performance on the FLEURS benchmark with substantial gains up to 40.1% for under-represented languages.
arXiv Detail & Related papers (2023-12-12T14:58:08Z) - LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech [70.3307853082527]
This work introduces LeBenchmark 2.0 an open-source framework for assessing and building SSL-equipped French speech technologies.
It includes documented, large-scale and heterogeneous corpora with up to 14,000 hours of heterogeneous speech.
It includes ten pre-trained SSL wav2vec 2.0 models containing from 26 million to one billion learnable parameters shared with the community.
arXiv Detail & Related papers (2023-09-11T14:13:09Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Scaling ASR Improves Zero and Few Shot Learning [23.896440724468246]
We propose data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets.
By training 1-10B parameter universal English ASR models, we push the limits of speech recognition performance across many domains.
For speakers with disorders due to brain damage, our best zero-shot and few-shot models achieve 22% and 60% relative improvement on the AphasiaBank test set, respectively.
arXiv Detail & Related papers (2021-11-10T21:18:59Z) - AmericasNLI: Evaluating Zero-shot Natural Language Understanding of
Pretrained Multilingual Models in Truly Low-resource Languages [75.08199398141744]
We present AmericasNLI, an extension of XNLI (Conneau et al.), to 10 indigenous languages of the Americas.
We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches.
We find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
arXiv Detail & Related papers (2021-04-18T05:32:28Z) - Multilingual Speech Translation with Efficient Finetuning of Pretrained
Models [82.22294901727933]
A minimalistic LNA (LayerNorm and Attention) finetuning can achieve zero-shot crosslingual and cross-modality transfer ability.
Our approach demonstrates strong zero-shot performance in a many-to-many multilingual model.
arXiv Detail & Related papers (2020-10-24T08:15:08Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.