Speech Recognition for Endangered and Extinct Samoyedic languages
- URL: http://arxiv.org/abs/2012.05331v1
- Date: Wed, 9 Dec 2020 21:41:40 GMT
- Title: Speech Recognition for Endangered and Extinct Samoyedic languages
- Authors: Niko Partanen, Mika H\"am\"al\"ainen, Tiina Klooster
- Abstract summary: This study presents experiments on speech recognition with endangered and extinct Samoyedic languages.
To best of our knowledge, this is the first time a functional ASR system is built for an extinct language.
- Score: 0.32228025627337864
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Our study presents a series of experiments on speech recognition with
endangered and extinct Samoyedic languages, spoken in Northern and Southern
Siberia. To best of our knowledge, this is the first time a functional ASR
system is built for an extinct language. We achieve with Kamas language a Label
Error Rate of 15\%, and conclude through careful error analysis that this
quality is already very useful as a starting point for refined human
transcriptions. Our results with related Nganasan language are more modest,
with best model having the error rate of 33\%. We show, however, through
experiments where Kamas training data is enlarged incrementally, that Nganasan
results are in line with what is expected under low-resource circumstances of
the language. Based on this, we provide recommendations for scenarios in which
further language documentation or archive processing activities could benefit
from modern ASR technology. All training data and processing scripts haven been
published on Zenodo with clear licences to ensure further work in this
important topic.
Related papers
- Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition [4.702636570667311]
The SENCOTEN language, spoken on the Saanich peninsula of southern Vancouver Island, is in the midst of vigorous language revitalization efforts.<n>We propose an ASR-driven documentation pipeline that leverages augmented speech data from a text-to-speech system.<n>An n-gram language model is also incorporated via shallow fusion or n-best restoring to maximize the use of available data.
arXiv Detail & Related papers (2025-07-14T21:44:35Z) - Evaluating Standard and Dialectal Frisian ASR: Multilingual Fine-tuning and Language Identification for Improved Low-resource Performance [9.624005980086707]
State-of-the-art methods deploy self-supervised transfer learning where a model pre-trained on large amounts of data is fine-tuned using little labeled data.
We show that Frisian ASR performance can be improved by using multilingual fine-tuning data and an auxiliary language identification task.
arXiv Detail & Related papers (2025-02-07T12:42:46Z) - Investigating the Impact of Language-Adaptive Fine-Tuning on Sentiment Analysis in Hausa Language Using AfriBERTa [2.5055584842618175]
Sentiment analysis (SA) plays a vital role in Natural Language Processing (NLP) by identifying sentiments expressed in text.
This study investigates the effectiveness of Language-Adaptive Fine-Tuning (LAFT) to improve SA performance in Hausa.
arXiv Detail & Related papers (2025-01-19T11:52:46Z) - Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling [50.62091603179394]
Whisper, one of the most advanced ASR models, handles 99 languages effectively.
However, Whisper struggles with unseen languages, those not included in its pre-training.
We propose methods that exploit these relationships to enhance ASR performance on unseen languages.
arXiv Detail & Related papers (2024-12-21T04:05:43Z) - Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text [22.19230427358921]
It is worth researching how to improve the performance of Whisper on under-represented languages.
We utilized easily accessible unpaired speech and text data and combined the language model GPT with Whisper on Kazakh.
We achieved more than 10% absolute WER reduction in multiple experiments.
arXiv Detail & Related papers (2024-08-10T13:39:13Z) - Killkan: The Automatic Speech Recognition Dataset for Kichwa with Morphosyntactic Information [8.099700053397278]
This paper presents Killkan, the first dataset for automatic speech recognition (ASR) in the Kichwa language, an indigenous language of Ecuador.
The dataset contains approximately 4 hours of audio with transcription, translation into Spanish, and morphosyntactic annotation in the format of Universal Dependencies.
The experiments show that the dataset makes it possible to develop the first ASR system for Kichwa with reliable quality despite its small dataset size.
arXiv Detail & Related papers (2024-04-23T20:26:07Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - A Novel Self-training Approach for Low-resource Speech Recognition [15.612232220719653]
We propose a self-training approach for automatic speech recognition (ASR) for low-resource settings.
Our approach significantly improves word error rate, achieving a relative improvement of 14.94%.
Our proposed approach reports the best results on the Common Voice Punjabi dataset.
arXiv Detail & Related papers (2023-08-10T01:02:45Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Intent Classification Using Pre-Trained Embeddings For Low Resource
Languages [67.40810139354028]
Building Spoken Language Understanding systems that do not rely on language specific Automatic Speech Recognition is an important yet less explored problem in language processing.
We present a comparative study aimed at employing a pre-trained acoustic model to perform Spoken Language Understanding in low resource scenarios.
We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios.
arXiv Detail & Related papers (2021-10-18T13:06:59Z) - AmericasNLI: Evaluating Zero-shot Natural Language Understanding of
Pretrained Multilingual Models in Truly Low-resource Languages [75.08199398141744]
We present AmericasNLI, an extension of XNLI (Conneau et al.), to 10 indigenous languages of the Americas.
We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches.
We find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
arXiv Detail & Related papers (2021-04-18T05:32:28Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.