Related papers: ASR advancements for indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana

ASR advancements for indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana

URL: http://arxiv.org/abs/2404.08368v1
Date: Fri, 12 Apr 2024 10:12:38 GMT
Title: ASR advancements for indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana
Authors: Monica Romero, Sandra Gomez, Iván G. Torre,
Abstract summary: We propose a reliable ASR model for each target language by crawling speech corpora spanning diverse sources. We show that freeze fine-tuning updates and dropout rate are more vital parameters than the total number of epochs of lr. We liberate our best models -- with no other ASR model reported until now for two Wa'ikhana and Kotiria.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Indigenous languages are a fundamental legacy in the development of human communication, embodying the unique identity and culture of local communities of America. The Second AmericasNLP Competition Track 1 of NeurIPS 2022 proposed developing automatic speech recognition (ASR) systems for five indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana. In this paper, we propose a reliable ASR model for each target language by crawling speech corpora spanning diverse sources and applying data augmentation methods that resulted in the winning approach in this competition. To achieve this, we systematically investigated the impact of different hyperparameters by a Bayesian search on the performance of the language models, specifically focusing on the variants of the Wav2vec2.0 XLS-R model: 300M and 1B parameters. Moreover, we performed a global sensitivity analysis to assess the contribution of various hyperparametric configurations to the performances of our best models. Importantly, our results show that freeze fine-tuning updates and dropout rate are more vital parameters than the total number of epochs of lr. Additionally, we liberate our best models -- with no other ASR model reported until now for two Wa'ikhana and Kotiria -- and the many experiments performed to pave the way to other researchers to continue improving ASR in minority languages. This insight opens up interesting avenues for future work, allowing for the advancement of ASR techniques in the preservation of minority indigenous and acknowledging the complexities involved in this important endeavour.

Related papers

Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition [4.702636570667311]
The SENCOTEN language, spoken on the Saanich peninsula of southern Vancouver Island, is in the midst of vigorous language revitalization efforts.<n>We propose an ASR-driven documentation pipeline that leverages augmented speech data from a text-to-speech system.<n>An n-gram language model is also incorporated via shallow fusion or n-best restoring to maximize the use of available data.
arXiv Detail & Related papers (2025-07-14T21:44:35Z)
Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages [0.43498389175652036]
This study integrates traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages. We demonstrate substantial improvements in word error rate, particularly in low-resource scenarios. While the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters.
arXiv Detail & Related papers (2025-03-30T18:03:52Z)
Aya Expanse: Combining Research Breakthroughs for a New Multilingual Frontier [72.5652085347547]
We introduce the Aya Expanse model family, a new generation of 8B and 32B parameter multilingual language models. By leveraging several years of research at Cohere For AI and Cohere, Aya Expanse sets a new state-of-the-art in multilingual performance. Our evaluations on the Arena-Hard-Auto dataset, translated into 23 languages, demonstrate that Aya Expanse 8B and 32B outperform leading open-weight models.
arXiv Detail & Related papers (2024-12-05T15:41:06Z)
Whisper Finetuning on Nepali Language [0.0]
This research focuses on making an exhaustive and generalized dataset followed by fine-tuning OpenAI's Whisper models to improve transcription accuracy for the Nepali language. We leverage publicly available ASR datasets and self-recorded custom datasets with a diverse range of accents, dialects, and speaking styles further enriched through augmentation. Our approach outperforms Whisper's baseline models trained on Fleur's dataset, achieving WER reductions of up to 36.2% on the small and 23.8% on medium models.
arXiv Detail & Related papers (2024-11-19T15:55:56Z)
Automatic Speech Recognition for the Ika Language [0.0]
We fine-tune the pretrained wav2vec 2.0 Massively translations Speech Models on a high-quality speech dataset compiled from New Testament Bible Multilingual in Ika. Our results show that fine-tuning multilingual pretrained models achieves a Word Error Rate (WER) of 0.5377 and Character Error Rate (CER) of 0.2651 with just over 1 hour of training data.
arXiv Detail & Related papers (2024-10-01T11:56:42Z)
Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking [68.77659513993507]
We present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy. Our results show spoken language identification accuracy improvements of 8.7% and 6.1%, respectively, and word error rates which are 3.3% and 2.0% lower on these benchmarks.
arXiv Detail & Related papers (2024-09-27T03:31:32Z)
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors. We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models. In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z)
A Novel Self-training Approach for Low-resource Speech Recognition [15.612232220719653]
We propose a self-training approach for automatic speech recognition (ASR) for low-resource settings. Our approach significantly improves word error rate, achieving a relative improvement of 14.94%. Our proposed approach reports the best results on the Common Voice Punjabi dataset.
arXiv Detail & Related papers (2023-08-10T01:02:45Z)
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z)
Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages [15.214673043019395]
We create Shrutilipi, a dataset which contains over 6,400 hours of labelled audio across 12 Indian languages. On average, Shrutilipi results in a 2.3x increase over publicly available labelled data. We show that adding Shrutilipi to the training set of Wav2Vec models leads to an average decrease in WER of 5.8% for 7 languages.
arXiv Detail & Related papers (2022-08-26T13:37:45Z)
No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z)
Scaling ASR Improves Zero and Few Shot Learning [23.896440724468246]
We propose data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets. By training 1-10B parameter universal English ASR models, we push the limits of speech recognition performance across many domains. For speakers with disorders due to brain damage, our best zero-shot and few-shot models achieve 22% and 60% relative improvement on the AphasiaBank test set, respectively.
arXiv Detail & Related papers (2021-11-10T21:18:59Z)
Arabic Speech Recognition by End-to-End, Modular Systems and Human [56.96327247226586]
We perform a comprehensive benchmarking for end-to-end transformer ASR, modular HMM-DNN ASR, and human speech recognition. For ASR the end-to-end work led to 12.5%, 27.5%, 23.8% WER; a new performance milestone for the MGB2, MGB3, and MGB5 challenges respectively. Our results suggest that human performance in the Arabic language is still considerably better than the machine with an absolute WER gap of 3.6% on average.
arXiv Detail & Related papers (2021-01-21T05:55:29Z)
Unsupervised Cross-lingual Representation Learning for Speech Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations. Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.