The evaluation of a code-switched Sepedi-English automatic speech
recognition system
- URL: http://arxiv.org/abs/2403.07947v1
- Date: Mon, 11 Mar 2024 15:11:28 GMT
- Title: The evaluation of a code-switched Sepedi-English automatic speech
recognition system
- Authors: Amanda Phaladi and Thipe Modipa
- Abstract summary: We present the evaluation of the Sepedi-English code-switched automatic speech recognition system.
This end-to-end system was developed using the Sepedi Prompted Code Switching corpus and the CTC approach.
The model produced the lowest WER of 41.9%, however, the model faced challenges in recognizing the Sepedi only text.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech technology is a field that encompasses various techniques and tools
used to enable machines to interact with speech, such as automatic speech
recognition (ASR), spoken dialog systems, and others, allowing a device to
capture spoken words through a microphone from a human speaker. End-to-end
approaches such as Connectionist Temporal Classification (CTC) and
attention-based methods are the most used for the development of ASR systems.
However, these techniques were commonly used for research and development for
many high-resourced languages with large amounts of speech data for training
and evaluation, leaving low-resource languages relatively underdeveloped. While
the CTC method has been successfully used for other languages, its
effectiveness for the Sepedi language remains uncertain. In this study, we
present the evaluation of the Sepedi-English code-switched automatic speech
recognition system. This end-to-end system was developed using the Sepedi
Prompted Code Switching corpus and the CTC approach. The performance of the
system was evaluated using both the NCHLT Sepedi test corpus and the Sepedi
Prompted Code Switching corpus. The model produced the lowest WER of 41.9%,
however, the model faced challenges in recognizing the Sepedi only text.
Related papers
- Leveraging Language ID to Calculate Intermediate CTC Loss for Enhanced
Code-Switching Speech Recognition [5.3545957730615905]
We introduce language identification information into the middle layer of the ASR model's encoder.
We aim to generate acoustic features that imply language distinctions in a more implicit way, reducing the model's confusion when dealing with language switching.
arXiv Detail & Related papers (2023-12-15T07:46:35Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Integrating Knowledge in End-to-End Automatic Speech Recognition for
Mandarin-English Code-Switching [41.88097793717185]
Code-Switching (CS) is a common linguistic phenomenon in multilingual communities.
This paper presents our investigations on end-to-end speech recognition for Mandarin-English CS speech.
arXiv Detail & Related papers (2021-12-19T17:31:15Z) - Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces
and Conformers [33.725831884078744]
The proposed CTC-CRF framework inherits the data-efficiency of the hybrid approach and the simplicity of the end-to-end approach.
We investigate techniques to enable the recently developed wordpiece modeling units and Conformer neural networks to be succesfully applied in CTC-CRFs.
arXiv Detail & Related papers (2021-07-07T04:12:06Z) - A review of on-device fully neural end-to-end automatic speech
recognition algorithms [20.469868150587075]
We review various end-to-end automatic speech recognition algorithms and their optimization techniques for on-device applications.
fully neural network end-to-end speech recognition algorithms have been proposed.
We extensively discuss their structures, performance, and advantages compared to conventional algorithms.
arXiv Detail & Related papers (2020-12-14T22:18:08Z) - Acoustics Based Intent Recognition Using Discovered Phonetic Units for
Low Resource Languages [51.0542215642794]
We propose a novel acoustics based intent recognition system that uses discovered phonetic units for intent classification.
We present results for two languages families - Indic languages and Romance languages, for two different intent recognition tasks.
arXiv Detail & Related papers (2020-11-07T00:35:31Z) - NAUTILUS: a Versatile Voice Cloning System [44.700803634034486]
NAUTILUS can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker.
It can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm.
It achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech.
arXiv Detail & Related papers (2020-05-22T05:00:20Z) - Meta-Transfer Learning for Code-Switched Speech Recognition [72.84247387728999]
We propose a new learning method, meta-transfer learning, to transfer learn on a code-switched speech recognition system in a low-resource setting.
Our model learns to recognize individual languages, and transfer them so as to better recognize mixed-language speech by conditioning the optimization on the code-switching data.
arXiv Detail & Related papers (2020-04-29T14:27:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.