Indonesian Automatic Speech Recognition with XLSR-53
- URL: http://arxiv.org/abs/2308.11589v1
- Date: Sun, 20 Aug 2023 09:59:40 GMT
- Title: Indonesian Automatic Speech Recognition with XLSR-53
- Authors: Panji Arisaputra, Amalia Zahra
- Abstract summary: This study focuses on the development of Indonesian Automatic Speech Recognition (ASR) using the XLSR-53 pre-trained model.
The use of this XLSR-53 pre-trained model is to significantly reduce the amount of training data in non-English languages.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This study focuses on the development of Indonesian Automatic Speech
Recognition (ASR) using the XLSR-53 pre-trained model, the XLSR stands for
cross-lingual speech representations. The use of this XLSR-53 pre-trained model
is to significantly reduce the amount of training data in non-English languages
required to achieve a competitive Word Error Rate (WER). The total amount of
data used in this study is 24 hours, 18 minutes, and 1 second: (1) TITML-IDN 14
hours and 31 minutes; (2) Magic Data 3 hours and 33 minutes; and (3) Common
Voice 6 hours, 14 minutes, and 1 second. With a WER of 20%, the model built in
this study can compete with similar models using the Common Voice dataset split
test. WER can be decreased by around 8% using a language model, resulted in WER
from 20% to 12%. Thus, the results of this study have succeeded in perfecting
previous research in contributing to the creation of a better Indonesian ASR
with a smaller amount of data.
Related papers
- Improving Code-Switching Speech Recognition with TTS Data Augmentation [58.34842693152991]
This paper explores multilingual text-to-speech (TTS) models as an effective data augmentation technique to address this shortage.<n>We fine-tune the multilingual CosyVoice2 TTS model on the SEAME dataset to generate synthetic conversational Chinese-English code-switching speech.
arXiv Detail & Related papers (2026-01-02T10:11:51Z) - Whispering in Amharic: Fine-tuning Whisper for Low-resource Language [3.2858851789879595]
This work explores fine-tuning OpenAI's Whisper automatic speech recognition model for Amharic.
We fine-tune it using datasets like Mozilla Common Voice, FLEURS, and the BDU-speech dataset.
The best-performing model, Whispersmall-am, significantly improves when finetuned on a mix of existing FLEURS data and new, unseen Amharic datasets.
arXiv Detail & Related papers (2025-03-24T09:39:41Z) - AfriHuBERT: A self-supervised speech representation model for African languages [44.722780475475915]
We present an extension of mHuBERT-147, a state-of-the-art (SOTA) and compact self-supervised learning (SSL) model, originally pretrained on 147 languages.
While mHuBERT-147 was pretrained on 16 African languages, we expand this to cover 39 African languages through continued pretraining on 6,500+ hours of speech data aggregated from diverse sources.
arXiv Detail & Related papers (2024-09-30T11:28:33Z) - HYBRINFOX at CheckThat! 2024 -- Task 1: Enhancing Language Models with Structured Information for Check-Worthiness Estimation [0.8083061106940517]
This paper summarizes the experiments and results of the HYBRINFOX team for the CheckThat! 2024 - Task 1 competition.
We propose an approach enriching Language Models such as RoBERTa with embeddings produced by triples.
arXiv Detail & Related papers (2024-07-04T11:33:54Z) - Automatic Speech Recognition Advancements for Indigenous Languages of the Americas [0.0]
The Second Americas (Americas Natural Language Processing) Competition Track 1 of NeurIPS (Neural Information Processing Systems) 2022 proposed the task of training automatic speech recognition systems for five Indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana.
We describe the fine-tuning of a state-of-the-art ASR model for each target language, using approximately 36.65 h of transcribed speech data from diverse sources enriched with data augmentation methods.
We release our best models for each language, marking the first open ASR models for Wa'ikhana and Kotiria.
arXiv Detail & Related papers (2024-04-12T10:12:38Z) - Convoifilter: A case study of doing cocktail party speech recognition [59.80042864360884]
The model can decrease ASR's word error rate (WER) from 80% to 26.4% through this approach.
We openly share our pre-trained model to foster further research hf.co/nguyenvulebinh/voice-filter.
arXiv Detail & Related papers (2023-08-22T12:09:30Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Improving RNN-T ASR Performance with Date-Time and Location Awareness [6.308539010172309]
We show that contextual information, when used individually, improves overall performance by as much as 3.48% relative to the baseline.
On specific domains, these contextual signals show improvements as high as 11.5%, without any significant degradation on others.
Our results indicate that with limited data to train the ASR model, contextual signals can improve the performance significantly.
arXiv Detail & Related papers (2021-06-11T05:57:30Z) - Arabic Speech Recognition by End-to-End, Modular Systems and Human [56.96327247226586]
We perform a comprehensive benchmarking for end-to-end transformer ASR, modular HMM-DNN ASR, and human speech recognition.
For ASR the end-to-end work led to 12.5%, 27.5%, 23.8% WER; a new performance milestone for the MGB2, MGB3, and MGB5 challenges respectively.
Our results suggest that human performance in the Arabic language is still considerably better than the machine with an absolute WER gap of 3.6% on average.
arXiv Detail & Related papers (2021-01-21T05:55:29Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.