AfriHuBERT: A self-supervised speech representation model for African languages
- URL: http://arxiv.org/abs/2409.20201v2
- Date: Sun, 01 Jun 2025 10:49:58 GMT
- Title: AfriHuBERT: A self-supervised speech representation model for African languages
- Authors: Jesujoba O. Alabi, Xuechen Liu, Dietrich Klakow, Junichi Yamagishi,
- Abstract summary: AfriHuBERT is an extension of mHuBERT-147, a compact self-supervised learning (SSL) model pretrained on 147 languages.<n>While mHuBERT-147 covered 16 African languages, we expand this to 1,226 through continued pretraining on 10K+ hours of speech data from diverse sources.<n>We evaluate AfriHuBERT on two key speech tasks, Spoken Language Identification (SLID) and Automatic Speech Recognition (ASR)<n>Our results show a +3.6% F1 score improvement for SLID and a -2.1% average Word Error Rate (WER) reduction for ASR over mHuBERT-147,
- Score: 44.722780475475915
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we present AfriHuBERT, an extension of mHuBERT-147, a compact self-supervised learning (SSL) model pretrained on 147 languages. While mHuBERT-147 covered 16 African languages, we expand this to 1,226 through continued pretraining on 10K+ hours of speech data from diverse sources, benefiting an African population of over 600M. We evaluate AfriHuBERT on two key speech tasks, Spoken Language Identification (SLID) and Automatic Speech Recognition (ASR), using the FLEURS benchmark. Our results show a +3.6% F1 score improvement for SLID and a -2.1% average Word Error Rate (WER) reduction for ASR over mHuBERT-147, and demonstrates competitiveness with larger SSL models such as MMS and XEUS. Further analysis shows that ASR models trained on AfriHuBERT exhibit improved cross-corpus generalization and are competitive in extremely low-resource ASR scenarios.
Related papers
- Self-supervised Speech Representations Still Struggle with African American Vernacular English [28.223877889211803]
Underperformance of ASR systems for speakers of marginalized language varieties is a well-documented phenomenon.
We investigate whether or not the recent wave of Self-Supervised Learning speech models can close the gap in ASR performance between AAVE and Mainstream American English.
arXiv Detail & Related papers (2024-08-26T13:29:25Z) - HYBRINFOX at CheckThat! 2024 -- Task 1: Enhancing Language Models with Structured Information for Check-Worthiness Estimation [0.8083061106940517]
This paper summarizes the experiments and results of the HYBRINFOX team for the CheckThat! 2024 - Task 1 competition.
We propose an approach enriching Language Models such as RoBERTa with embeddings produced by triples.
arXiv Detail & Related papers (2024-07-04T11:33:54Z) - mHuBERT-147: A Compact Multilingual HuBERT Model [23.207762084023933]
mHuBERT-147 is the first general-purpose multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data.
To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method.
Our findings indicate that mHuBERT147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.
arXiv Detail & Related papers (2024-06-10T15:32:42Z) - Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context [2.3066058341851816]
We present the first self-supervised multilingual speech model trained exclusively on African speech.
The model learned from nearly 60 000 hours of unlabeled speech segments in 21 languages and dialects spoken in sub-Saharan Africa.
arXiv Detail & Related papers (2024-04-02T14:43:36Z) - LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech [70.3307853082527]
This work introduces LeBenchmark 2.0 an open-source framework for assessing and building SSL-equipped French speech technologies.
It includes documented, large-scale and heterogeneous corpora with up to 14,000 hours of heterogeneous speech.
It includes ten pre-trained SSL wav2vec 2.0 models containing from 26 million to one billion learnable parameters shared with the community.
arXiv Detail & Related papers (2023-09-11T14:13:09Z) - Indonesian Automatic Speech Recognition with XLSR-53 [0.0]
This study focuses on the development of Indonesian Automatic Speech Recognition (ASR) using the XLSR-53 pre-trained model.
The use of this XLSR-53 pre-trained model is to significantly reduce the amount of training data in non-English languages.
arXiv Detail & Related papers (2023-08-20T09:59:40Z) - DN at SemEval-2023 Task 12: Low-Resource Language Text Classification
via Multilingual Pretrained Language Model Fine-tuning [0.0]
Most existing models and datasets for sentiment analysis are developed for high-resource languages, such as English and Chinese.
The AfriSenti-SemEval 2023 Shared Task 12 aims to fill this gap by evaluating sentiment analysis models on low-resource African languages.
We present our solution to the shared task, where we employed different multilingual XLM-R models with classification head trained on various data.
arXiv Detail & Related papers (2023-05-04T07:28:45Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - XLS-R: Self-supervised Cross-lingual Speech Representation Learning at
Scale [48.0390317915984]
XLS-R is a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0.
We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages.
arXiv Detail & Related papers (2021-11-17T18:49:42Z) - AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages.
We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages.
We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z) - AmericasNLI: Evaluating Zero-shot Natural Language Understanding of
Pretrained Multilingual Models in Truly Low-resource Languages [75.08199398141744]
We present AmericasNLI, an extension of XNLI (Conneau et al.), to 10 indigenous languages of the Americas.
We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches.
We find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
arXiv Detail & Related papers (2021-04-18T05:32:28Z) - Speech Recognition for Endangered and Extinct Samoyedic languages [0.32228025627337864]
This study presents experiments on speech recognition with endangered and extinct Samoyedic languages.
To best of our knowledge, this is the first time a functional ASR system is built for an extinct language.
arXiv Detail & Related papers (2020-12-09T21:41:40Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - Extending Multilingual BERT to Low-Resource Languages [71.0976635999159]
M-BERT (M-BERT) has been a huge success in both supervised and zero-shot cross-lingual transfer learning.
We propose a simple but effective approach to extend M-BERT so that it can benefit any new language.
arXiv Detail & Related papers (2020-04-28T16:36:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.