Related papers: AfriHuBERT: A self-supervised speech representation model for African languages

AfriHuBERT: A self-supervised speech representation model for African languages

URL: http://arxiv.org/abs/2409.20201v1
Date: Mon, 30 Sep 2024 11:28:33 GMT
Title: AfriHuBERT: A self-supervised speech representation model for African languages
Authors: Jesujoba O. Alabi, Xuechen Liu, Dietrich Klakow, Junichi Yamagishi,
Abstract summary: We present an extension of mHuBERT-147, a state-of-the-art (SOTA) and compact self-supervised learning (SSL) model, originally pretrained on 147 languages. While mHuBERT-147 was pretrained on 16 African languages, we expand this to cover 39 African languages through continued pretraining on 6,500+ hours of speech data aggregated from diverse sources.
Score: 44.722780475475915
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this work, we present AfriHuBERT, an extension of mHuBERT-147, a state-of-the-art (SOTA) and compact self-supervised learning (SSL) model, originally pretrained on 147 languages. While mHuBERT-147 was pretrained on 16 African languages, we expand this to cover 39 African languages through continued pretraining on 6,500+ hours of speech data aggregated from diverse sources, including 23 newly added languages. We evaluate AfriHuBERT on two key speech tasks: Language Identification (LID) and Automatic Speech Recognition (ASR) using FLEURS dataset. Our results show a +4% F1 score improvement on average for LID and a -1.2% average Word Error Rate (WER) reduction for ASR. Further analysis shows that ASR models trained on AfriHuBERT exhibit improved cross-corpus generalization. Additionally, the analysis indicates that the FLEURS have data quality limitations that may affect their suitability for evaluating low-resource African languages, suggesting the need for better evaluation benchmarks for these languages.

Related papers

Self-supervised Speech Representations Still Struggle with African American Vernacular English [28.223877889211803]
Underperformance of ASR systems for speakers of marginalized language varieties is a well-documented phenomenon. We investigate whether or not the recent wave of Self-Supervised Learning speech models can close the gap in ASR performance between AAVE and Mainstream American English.
arXiv Detail & Related papers (2024-08-26T13:29:25Z)
HYBRINFOX at CheckThat! 2024 -- Task 1: Enhancing Language Models with Structured Information for Check-Worthiness Estimation [0.8083061106940517]
This paper summarizes the experiments and results of the HYBRINFOX team for the CheckThat! 2024 - Task 1 competition. We propose an approach enriching Language Models such as RoBERTa with embeddings produced by triples.
arXiv Detail & Related papers (2024-07-04T11:33:54Z)
mHuBERT-147: A Compact Multilingual HuBERT Model [23.207762084023933]
mHuBERT-147 is the first general-purpose multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method. Our findings indicate that mHuBERT147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.
arXiv Detail & Related papers (2024-06-10T15:32:42Z)
Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context [2.3066058341851816]
We present the first self-supervised multilingual speech model trained exclusively on African speech. The model learned from nearly 60 000 hours of unlabeled speech segments in 21 languages and dialects spoken in sub-Saharan Africa.
arXiv Detail & Related papers (2024-04-02T14:43:36Z)
LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech [70.3307853082527]
This work introduces LeBenchmark 2.0 an open-source framework for assessing and building SSL-equipped French speech technologies. It includes documented, large-scale and heterogeneous corpora with up to 14,000 hours of heterogeneous speech. It includes ten pre-trained SSL wav2vec 2.0 models containing from 26 million to one billion learnable parameters shared with the community.
arXiv Detail & Related papers (2023-09-11T14:13:09Z)
Indonesian Automatic Speech Recognition with XLSR-53 [0.0]
This study focuses on the development of Indonesian Automatic Speech Recognition (ASR) using the XLSR-53 pre-trained model. The use of this XLSR-53 pre-trained model is to significantly reduce the amount of training data in non-English languages.
arXiv Detail & Related papers (2023-08-20T09:59:40Z)
DN at SemEval-2023 Task 12: Low-Resource Language Text Classification via Multilingual Pretrained Language Model Fine-tuning [0.0]
Most existing models and datasets for sentiment analysis are developed for high-resource languages, such as English and Chinese. The AfriSenti-SemEval 2023 Shared Task 12 aims to fill this gap by evaluating sentiment analysis models on low-resource African languages. We present our solution to the shared task, where we employed different multilingual XLM-R models with classification head trained on various data.
arXiv Detail & Related papers (2023-05-04T07:28:45Z)
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z)
MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development. We create the largest human-annotated NER dataset for 20 African languages. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z)
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale [48.0390317915984]
XLS-R is a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages.
arXiv Detail & Related papers (2021-11-17T18:49:42Z)
AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages. We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages. We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z)
AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages [75.08199398141744]
We present AmericasNLI, an extension of XNLI (Conneau et al.), to 10 indigenous languages of the Americas. We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches. We find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
arXiv Detail & Related papers (2021-04-18T05:32:28Z)
Speech Recognition for Endangered and Extinct Samoyedic languages [0.32228025627337864]
This study presents experiments on speech recognition with endangered and extinct Samoyedic languages. To best of our knowledge, this is the first time a functional ASR system is built for an extinct language.
arXiv Detail & Related papers (2020-12-09T21:41:40Z)
Unsupervised Cross-lingual Representation Learning for Speech Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations. Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
Extending Multilingual BERT to Low-Resource Languages [71.0976635999159]
M-BERT (M-BERT) has been a huge success in both supervised and zero-shot cross-lingual transfer learning. We propose a simple but effective approach to extend M-BERT so that it can benefit any new language.
arXiv Detail & Related papers (2020-04-28T16:36:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.