Scaling HuBERT for African Languages: From Base to Large and XL
- URL: http://arxiv.org/abs/2511.23370v1
- Date: Fri, 28 Nov 2025 17:17:40 GMT
- Title: Scaling HuBERT for African Languages: From Base to Large and XL
- Authors: Antoine Caubrière, Elodie Gauthier,
- Abstract summary: This work introduces SSA-HuBERT-Large (317M parameters) and SSA-HuBERT-XL (964M parameters)<n>The first large models trained solely on African speech, alongside a BASE size counterpart.<n>By conducting a carefully controlled experimental study focused exclusively on Sub-Saharan languages, we demonstrate that larger architectures significantly improve performance by effectively leveraging large audio datasets.
- Score: 0.5825599299113071
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Despite recent progress in multilingual speech processing, African languages remain under-represented in both research and deployed systems, particularly when it comes to strong, open-weight encoders that transfer well under low-resource supervision. Self-supervised learning has proven especially promising in such settings, yet most publicly released models targeting African speech remain at BASE scale, leaving unanswered whether larger encoders, trained exclusively on Africa-centric audio, offer tangible benefits and how model capacity interacts with data composition. This work addresses that gap by introducing SSA-HuBERT-Large (317M parameters) and SSA-HuBERT-XL (964M parameters), the first large models trained solely on African speech, alongside a BASE size counterpart. We release these models as open weights: see https://huggingface.co/collections/Orange/african-speech-foundation-models. By conducting a carefully controlled experimental study focused exclusively on Sub-Saharan languages, covering automatic speech recognition (ASR) and language identification (LID) tasks, we demonstrate that larger architectures significantly improve performance by effectively leveraging large audio datasets.
Related papers
- AfroScope: A Framework for Studying the Linguistic Landscape of Africa [27.262469904340836]
We introduce AfroScope, a unified framework for African LID, including AfroScope-Data and AfroScope-Models.<n>We propose a hierarchical classification approach that leverages Mirror-Serengeti, a specialized embedding model targeting 29 closely related or geographically proximate languages.<n>We analyze cross linguistic transfer and domain effects, offering guidance for building robust African LID systems.
arXiv Detail & Related papers (2026-01-19T19:30:35Z) - AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR [2.6822781046552824]
AfriSpeech-MultiBench is the first domain-specific evaluation suite for over 100 African English accents across 10+ countries.<n>We benchmark a diverse range of open, closed, unimodal ASR and multimodal LLM-based speech recognition systems.<n>Our empirical analysis reveals systematic variation: open-source ASR models excels in spontaneous speech contexts but degrades on noisy, non-native dialogue.<n> proprietary models deliver high accuracy on clean speech but vary significantly by country and domain.
arXiv Detail & Related papers (2025-11-18T08:44:17Z) - Speech Language Models for Under-Represented Languages: Insights from Wolof [9.14632796153174]
We present our journey in training a speech language model for Wolof, an underrepresented language spoken in West Africa.<n>We first emphasize the importance of collecting large-scale, spontaneous, high-quality unsupervised speech data.<n>We show that continued pretraining HuBERT on this dataset outperforms both the base model and African-centric models on ASR.
arXiv Detail & Related papers (2025-09-18T19:01:48Z) - Hello Afrika: Speech Commands in Kinyarwanda [0.0]
There is a dearth of speech command models for African languages.<n>Hello Afrika aims to address this issue and its first iteration is focused on the Kinyarwanda language.<n>The model was built off a custom speech command corpus made up of general directives, numbers, and a wake word.
arXiv Detail & Related papers (2025-06-16T16:30:19Z) - Lugha-Llama: Adapting Large Language Models for African Languages [48.97516583523523]
Large language models (LLMs) have achieved impressive results in a wide range of natural language applications.<n>We consider how to adapt LLMs to low-resource African languages.<n>We find that combining curated data from African languages with high-quality English educational texts results in a training mix that substantially improves the model's performance on these languages.
arXiv Detail & Related papers (2025-04-09T02:25:53Z) - Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [68.33068005789116]
We introduce ReDial, a benchmark containing 1.2K+ parallel query pairs in Standardized English and AAVE.<n>We evaluate widely used models, including GPT, Claude, Llama, Mistral, and the Phi model families.<n>Our work establishes a systematic and objective framework for analyzing LLM bias in dialectal queries.
arXiv Detail & Related papers (2024-10-14T18:44:23Z) - DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)<n>We present a simple yet effective automatic process for creating speech-text pair data.<n>Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - AfroBench: How Good are Large Language Models on African Languages? [55.35674466745322]
AfroBench is a benchmark for evaluating the performance of LLMs across 64 African languages.<n>AfroBench consists of nine natural language understanding datasets, six text generation datasets, six knowledge and question answering tasks, and one mathematical reasoning task.
arXiv Detail & Related papers (2023-11-14T08:10:14Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.