KinSPEAK: Improving speech recognition for Kinyarwanda via
semi-supervised learning methods
- URL: http://arxiv.org/abs/2308.11863v3
- Date: Sat, 2 Mar 2024 07:14:02 GMT
- Title: KinSPEAK: Improving speech recognition for Kinyarwanda via
semi-supervised learning methods
- Authors: Antoine Nzeyimana
- Abstract summary: We show that using self-supervised pre-training, following a simple curriculum schedule during fine-tuning and using semi-supervised learning significantly improve speech recognition performance for Kinyarwanda.
Our model achieves 3.2% word error rate (WER) on a new dataset and 15.6% WER on Mozilla Common Voice benchmark.
- Score: 3.3721926640077804
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Despite recent availability of large transcribed Kinyarwanda speech data,
achieving robust speech recognition for Kinyarwanda is still challenging. In
this work, we show that using self-supervised pre-training, following a simple
curriculum schedule during fine-tuning and using semi-supervised learning to
leverage large unlabelled speech data significantly improve speech recognition
performance for Kinyarwanda. Our approach focuses on using public domain data
only. A new studio-quality speech dataset is collected from a public website,
then used to train a clean baseline model. The clean baseline model is then
used to rank examples from a more diverse and noisy public dataset, defining a
simple curriculum training schedule. Finally, we apply semi-supervised learning
to label and learn from large unlabelled data in five successive generations.
Our final model achieves 3.2% word error rate (WER) on the new dataset and
15.6% WER on Mozilla Common Voice benchmark, which is state-of-the-art to the
best of our knowledge. Our experiments also indicate that using syllabic rather
than character-based tokenization results in better speech recognition
performance for Kinyarwanda.
Related papers
- Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text [22.19230427358921]
It is worth researching how to improve the performance of Whisper on under-represented languages.
We utilized easily accessible unpaired speech and text data and combined the language model GPT with Whisper on Kazakh.
We achieved more than 10% absolute WER reduction in multiple experiments.
arXiv Detail & Related papers (2024-08-10T13:39:13Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Device Directedness with Contextual Cues for Spoken Dialog Systems [15.96415881820669]
We define barge-in verification as a supervised learning task where audio-only information is used to classify user spoken dialogue into true and false barge-ins.
We use low-level speech representations from a self-supervised representation learning model for our downstream classification task.
We propose a novel technique to infuse lexical information directly into speech representations to improve the domain-specific language information implicitly learned during pre-training.
arXiv Detail & Related papers (2022-11-23T19:49:11Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - Simple and Effective Zero-shot Cross-lingual Phoneme Recognition [46.76787843369816]
This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrained wav2vec 2.0 model to transcribe unseen languages.
Experiments show that this simple method significantly outperforms prior work which introduced task-specific architectures.
arXiv Detail & Related papers (2021-09-23T22:50:32Z) - UniSpeech: Unified Speech Representation Learning with Labeled and
Unlabeled Data [54.733889961024445]
We propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data.
We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus.
arXiv Detail & Related papers (2021-01-19T12:53:43Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.