Fast Development of ASR in African Languages using Self Supervised
Speech Representation Learning
- URL: http://arxiv.org/abs/2103.08993v1
- Date: Tue, 16 Mar 2021 11:37:03 GMT
- Title: Fast Development of ASR in African Languages using Self Supervised
Speech Representation Learning
- Authors: Jama Hussein Mohamud, Lloyd Acquaye Thompson, Aissatou Ndoye, and
Laurent Besacier
- Abstract summary: This paper describes the results of an informal collaboration launched during the African Master of Machine Intelligence (AMMI) in June 2020.
After a series of lectures and labs on speech data collection using mobile applications, a small group of students and the lecturer continued working on automatic speech recognition (ASR) project for three languages: Wolof, Ga, and Somali.
This paper describes how data was collected and ASR systems developed with a small amount (1h) of transcribed speech as training data.
- Score: 13.7466513616362
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper describes the results of an informal collaboration launched during
the African Master of Machine Intelligence (AMMI) in June 2020. After a series
of lectures and labs on speech data collection using mobile applications and on
self-supervised representation learning from speech, a small group of students
and the lecturer continued working on automatic speech recognition (ASR)
project for three languages: Wolof, Ga, and Somali. This paper describes how
data was collected and ASR systems developed with a small amount (1h) of
transcribed speech as training data. In these low resource conditions,
pre-training a model on large amounts of raw speech was fundamental for the
efficiency of ASR systems developed.
Related papers
- Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - Speech Recognition Rescoring with Large Speech-Text Foundation Models [20.145389016219106]
Large language models (LLM) have demonstrated the ability to understand human language by leveraging large amount of text data.
Automatic speech recognition (ASR) systems are often limited by available transcribed speech data.
Recent multi-modal large language models have demonstrated strong spoken language understanding.
arXiv Detail & Related papers (2024-09-25T06:17:23Z) - Error-preserving Automatic Speech Recognition of Young English Learners' Language [6.491559928368298]
One of the central skills that language learners need to practice is speaking the language.
Recent advances in speech technology and natural language processing allow for the creation of novel tools to practice their speaking skills.
We build an ASR system that works on spontaneous speech by young language learners and preserves their errors.
arXiv Detail & Related papers (2024-06-05T13:15:37Z) - Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach [0.6445605125467574]
This study introduces a novel pipeline designed to generate ASR training datasets from audiobooks.
The common structure of these audiobooks poses a unique challenge due to the extensive length of audio segments.
We propose a method for effectively aligning audio with its corresponding text and segmenting it into lengths suitable for ASR training.
arXiv Detail & Related papers (2024-06-03T15:38:40Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - Phonemic Representation and Transcription for Speech to Text
Applications for Under-resourced Indigenous African Languages: The Case of
Kiswahili [0.0]
It has emerged that several African indigenous languages, including Kiswahili, are technologically under-resourced.
This paper explores the transcription process and the development of a Kiswahili speech corpus.
It provides an updated Kiswahili phoneme dictionary for the ASR model that was created using the CMU Sphinx speech recognition toolbox.
arXiv Detail & Related papers (2022-10-29T09:04:09Z) - Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages.
We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources.
We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - Generative Adversarial Training Data Adaptation for Very Low-resource
Automatic Speech Recognition [31.808145263757105]
We use CycleGAN-based non-parallel voice conversion technology to forge a labeled training data that is close to the test speaker's speech.
We evaluate this speaker adaptation approach on two low-resource corpora, namely, Ainu and Mboshi.
arXiv Detail & Related papers (2020-05-19T07:35:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.