XLS-R Deep Learning Model for Multilingual ASR on Low- Resource
Languages: Indonesian, Javanese, and Sundanese
- URL: http://arxiv.org/abs/2401.06832v1
- Date: Fri, 12 Jan 2024 13:44:48 GMT
- Title: XLS-R Deep Learning Model for Multilingual ASR on Low- Resource
Languages: Indonesian, Javanese, and Sundanese
- Authors: Panji Arisaputra, Alif Tri Handoyo and Amalia Zahra
- Abstract summary: The study aims to improve ASR performance in converting spoken language into written text, specifically for Indonesian, Javanese, and Sundanese languages.
The results show that the XLS-R 300m model achieves competitive Word Error Rate (WER) measurements, with a slight compromise in performance for Javanese and Sundanese languages.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This research paper focuses on the development and evaluation of Automatic
Speech Recognition (ASR) technology using the XLS-R 300m model. The study aims
to improve ASR performance in converting spoken language into written text,
specifically for Indonesian, Javanese, and Sundanese languages. The paper
discusses the testing procedures, datasets used, and methodology employed in
training and evaluating the ASR systems. The results show that the XLS-R 300m
model achieves competitive Word Error Rate (WER) measurements, with a slight
compromise in performance for Javanese and Sundanese languages. The integration
of a 5-gram KenLM language model significantly reduces WER and enhances ASR
accuracy. The research contributes to the advancement of ASR technology by
addressing linguistic diversity and improving performance across various
languages. The findings provide insights into optimizing ASR accuracy and
applicability for diverse linguistic contexts.
Related papers
- Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking [68.77659513993507]
We present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy.
Our results show spoken language identification accuracy improvements of 8.7% and 6.1%, respectively, and word error rates which are 3.3% and 2.0% lower on these benchmarks.
arXiv Detail & Related papers (2024-09-27T03:31:32Z) - Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection [49.27067541740956]
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction.
Building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese.
We propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages.
arXiv Detail & Related papers (2024-09-17T08:36:45Z) - Romanization Encoding For Multilingual ASR [17.296868524096986]
We introduce romanization encoding for script-heavy languages to optimize multilingual and code-switching Automatic Speech Recognition systems.
By adopting romanization encoding alongside a balanced tokenizer within a FastConformer-RNNT framework equipped with a Roman2Char module, we significantly reduce vocabulary and output dimensions.
Our method decouples acoustic modeling and language modeling, enhancing the flexibility and adaptability of the system.
arXiv Detail & Related papers (2024-07-05T09:13:24Z) - Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques [17.166092544686553]
This study benchmarks Speech Emotion Recognition using ASR transcripts with varying Word Error Rates (WERs) from eleven models on three well-known corpora.
We propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript.
arXiv Detail & Related papers (2024-06-12T15:59:25Z) - Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach [0.6445605125467574]
This study introduces a novel pipeline designed to generate ASR training datasets from audiobooks.
The common structure of these audiobooks poses a unique challenge due to the extensive length of audio segments.
We propose a method for effectively aligning audio with its corresponding text and segmenting it into lengths suitable for ASR training.
arXiv Detail & Related papers (2024-06-03T15:38:40Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Data Augmentation for Low-Resource Quechua ASR Improvement [2.260916274164351]
Deep learning methods have made it possible to deploy systems with word error rates below 5% for ASR of English.
For so-called low-resource languages, methods of creating new resources on the basis of existing ones are being investigated.
We describe our data augmentation approach to improve the results of ASR models for low-resource and agglutinative languages.
arXiv Detail & Related papers (2022-07-14T12:49:15Z) - LeBenchmark: A Reproducible Framework for Assessing Self-Supervised
Representation Learning from Speech [63.84741259993937]
Self-Supervised Learning (SSL) using huge unlabeled data has been successfully explored for image and natural language processing.
Recent works also investigated SSL from speech.
We propose LeBenchmark: a reproducible framework for assessing SSL from speech.
arXiv Detail & Related papers (2021-04-23T08:27:09Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.