Related papers: Hello Afrika: Speech Commands in Kinyarwanda

Hello Afrika: Speech Commands in Kinyarwanda

URL: http://arxiv.org/abs/2507.01024v1
Date: Mon, 16 Jun 2025 16:30:19 GMT
Title: Hello Afrika: Speech Commands in Kinyarwanda
Authors: George Igwegbe, Martins Awojide, Mboh Bless, Nirel Kadzo,
Abstract summary: There is a dearth of speech command models for African languages.<n>Hello Afrika aims to address this issue and its first iteration is focused on the Kinyarwanda language.<n>The model was built off a custom speech command corpus made up of general directives, numbers, and a wake word.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Voice or Speech Commands are a subset of the broader Spoken Word Corpus of a language which are essential for non-contact control of and activation of larger AI systems in devices used in everyday life especially for persons with disabilities. Currently, there is a dearth of speech command models for African languages. The Hello Afrika project aims to address this issue and its first iteration is focused on the Kinyarwanda language since the country has shown interest in developing speech recognition technologies culminating in one of the largest datasets on Mozilla Common Voice. The model was built off a custom speech command corpus made up of general directives, numbers, and a wake word. The final model was deployed on multiple devices (PC, Mobile Phone and Edge Devices) and the performance was assessed using suitable metrics.

Related papers

Scaling HuBERT for African Languages: From Base to Large and XL [0.5825599299113071]
This work introduces SSA-HuBERT-Large (317M parameters) and SSA-HuBERT-XL (964M parameters)<n>The first large models trained solely on African speech, alongside a BASE size counterpart.<n>By conducting a carefully controlled experimental study focused exclusively on Sub-Saharan languages, we demonstrate that larger architectures significantly improve performance by effectively leveraging large audio datasets.
arXiv Detail & Related papers (2025-11-28T17:17:40Z)
AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR [2.6822781046552824]
AfriSpeech-MultiBench is the first domain-specific evaluation suite for over 100 African English accents across 10+ countries.<n>We benchmark a diverse range of open, closed, unimodal ASR and multimodal LLM-based speech recognition systems.<n>Our empirical analysis reveals systematic variation: open-source ASR models excels in spontaneous speech contexts but degrades on noisy, non-native dialogue.<n> proprietary models deliver high accuracy on clean speech but vary significantly by country and domain.
arXiv Detail & Related papers (2025-11-18T08:44:17Z)
Long-Form Speech Generation with Spoken Language Models [64.29591880693468]
SpeechSSM learns from and sample long-form spoken audio in a single decoding session without text intermediates.<n>New embedding-based and LLM-judged metrics; quality measurements over length and time; and a new benchmark for long-form speech processing and generation, LibriSpeech-Long.
arXiv Detail & Related papers (2024-12-24T18:56:46Z)
Luganda Speech Intent Recognition for IoT Applications [0.3374875022248865]
This research project aimed to develop a Luganda speech intent classification system for IoT applications. The project uses hardware components such as Raspberry Pi, Wio Terminal, and ESP32 nodes as microcontrollers. The ultimate objective of this work was to enable voice control using Luganda, which was accomplished through a natural language processing (NLP) model deployed on the Raspberry Pi.
arXiv Detail & Related papers (2024-05-16T10:14:00Z)
Direct Punjabi to English speech translation using discrete units [4.883313216485195]
We present a direct speech-to-speech translation model for one of the Indic languages called Punjabi to English. We also explore the performance of using a discrete representation of speech called discrete acoustic units as input to the Transformer-based translation model. Our results show that the U2UT model performs better than the Speech-to-Unit Translation (S2UT) model by a 3.69 BLEU score.
arXiv Detail & Related papers (2024-02-25T03:03:34Z)
AnnoTheia: A Semi-Automatic Annotation Toolkit for Audio-Visual Speech Technologies [0.0]
We present AnnoTheia, a semi-automatic annotation toolkit that detects when a person speaks on the scene and the corresponding transcription. To show the complete process of preparing AnnoTheia for a language of interest, we also describe the adaptation of a pre-trained model for active speaker detection to Spanish.
arXiv Detail & Related papers (2024-02-20T17:07:08Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. Main ingredients are a new dataset based on readings of publicly available religious texts. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z)
Plug-and-Play Multilingual Few-shot Spoken Words Recognition [3.591566487849146]
We propose PLiX, a multilingual and plug-and-play keyword spotting system. Our few-shot deep models are learned with millions of one-second audio clips across 20 languages. We show that PLiX can generalize to novel spoken words given as few as just one support example.
arXiv Detail & Related papers (2023-05-03T18:58:14Z)
Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling [92.55131711064935]
We propose a cross-lingual neural language model, VALL-E X, for cross-lingual speech synthesis. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. It can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment.
arXiv Detail & Related papers (2023-03-07T14:31:55Z)
Using Radio Archives for Low-Resource Speech Recognition: Towards an Intelligent Virtual Assistant for Illiterate Users [3.3946853660795884]
In many countries, illiterate people tend to speak only low-resource languages. We investigate the effectiveness of unsupervised speech representation learning on noisy radio broadcasting archives. Our contributions offer a path forward for ethical AI research to serve the needs of those most disadvantaged by the digital divide.
arXiv Detail & Related papers (2021-04-27T10:09:34Z)
Acoustics Based Intent Recognition Using Discovered Phonetic Units for Low Resource Languages [51.0542215642794]
We propose a novel acoustics based intent recognition system that uses discovered phonetic units for intent classification. We present results for two languages families - Indic languages and Romance languages, for two different intent recognition tasks.
arXiv Detail & Related papers (2020-11-07T00:35:31Z)
Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams [58.617181880383605]
In this work, we propose a novel approach using phonetic posteriorgrams. Our method doesn't need hand-crafted features and is more robust to noise compared to recent approaches. Our model is the first to support multilingual/mixlingual speech as input with convincing results.
arXiv Detail & Related papers (2020-06-20T16:32:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.