Related papers: End-to-End Transformer-based Automatic Speech Recognition for Northern Kurdish: A Pioneering Approach

End-to-End Transformer-based Automatic Speech Recognition for Northern Kurdish: A Pioneering Approach

URL: http://arxiv.org/abs/2410.16330v1
Date: Sat, 19 Oct 2024 11:46:30 GMT
Title: End-to-End Transformer-based Automatic Speech Recognition for Northern Kurdish: A Pioneering Approach
Authors: Abdulhady Abas Abdullah, Shima Tabibian, Hadi Veisi, Aso Mahmudi, Tarik Rashid,
Abstract summary: This paper introduces a study exploring the effectiveness of Whisper, a pre-trained ASR model, for Northern Kurdish (Kurmanji) an under-resourced language spoken in the Middle East. Using a Northern Kurdish fine-tuning speech corpus containing approximately 68 hours of validated transcribed data, our experiments demonstrate that the additional module fine-tuning strategy significantly improves ASR accuracy.
Score: 1.3689715712707342
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic Speech Recognition (ASR) for low-resource languages remains a challenging task due to limited training data. This paper introduces a comprehensive study exploring the effectiveness of Whisper, a pre-trained ASR model, for Northern Kurdish (Kurmanji) an under-resourced language spoken in the Middle East. We investigate three fine-tuning strategies: vanilla, specific parameters, and additional modules. Using a Northern Kurdish fine-tuning speech corpus containing approximately 68 hours of validated transcribed data, our experiments demonstrate that the additional module fine-tuning strategy significantly improves ASR accuracy on a specialized test set, achieving a Word Error Rate (WER) of 10.5% and Character Error Rate (CER) of 5.7% with Whisper version 3. These results underscore the potential of sophisticated transformer models for low-resource ASR and emphasize the importance of tailored fine-tuning techniques for optimal performance.

Related papers

Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval [49.1574468325115]
We introduce Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones.<n>Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10.<n>More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M) remain competitive while being over 13x smaller.
arXiv Detail & Related papers (2025-05-25T23:06:20Z)
Whispering in Amharic: Fine-tuning Whisper for Low-resource Language [3.2858851789879595]
This work explores fine-tuning OpenAI's Whisper automatic speech recognition model for Amharic. We fine-tune it using datasets like Mozilla Common Voice, FLEURS, and the BDU-speech dataset. The best-performing model, Whispersmall-am, significantly improves when finetuned on a mix of existing FLEURS data and new, unseen Amharic datasets.
arXiv Detail & Related papers (2025-03-24T09:39:41Z)
The Surprising Effectiveness of Test-Time Training for Abstract Reasoning [64.36534512742736]
We investigate the effectiveness of test-time training (TTT) as a mechanism for improving models' reasoning capabilities. TTT significantly improves performance on ARC tasks, achieving up to 6x improvement in accuracy compared to base fine-tuned models. Our findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in neural language models.
arXiv Detail & Related papers (2024-11-11T18:59:45Z)
Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking [68.77659513993507]
We present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy. Our results show spoken language identification accuracy improvements of 8.7% and 6.1%, respectively, and word error rates which are 3.3% and 2.0% lower on these benchmarks.
arXiv Detail & Related papers (2024-09-27T03:31:32Z)
A Novel Self-training Approach for Low-resource Speech Recognition [15.612232220719653]
We propose a self-training approach for automatic speech recognition (ASR) for low-resource settings. Our approach significantly improves word error rate, achieving a relative improvement of 14.94%. Our proposed approach reports the best results on the Common Voice Punjabi dataset.
arXiv Detail & Related papers (2023-08-10T01:02:45Z)
Strategies for improving low resource speech to text translation relying on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST) We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z)
Parameter-Efficient Learning for Text-to-Speech Accent Adaptation [58.356667204518985]
This paper presents a parameter-efficient learning (PEL) to develop a low-resource accent adaptation for text-to-speech (TTS) A resource-efficient adaptation from a frozen pre-trained TTS model is developed by using only 1.2% to 0.8% of original trainable parameters. Experiment results show that the proposed methods can achieve competitive naturalness with parameter-efficient decoder fine-tuning.
arXiv Detail & Related papers (2023-05-18T22:02:59Z)
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z)
Data Augmentation for Low-Resource Quechua ASR Improvement [2.260916274164351]
Deep learning methods have made it possible to deploy systems with word error rates below 5% for ASR of English. For so-called low-resource languages, methods of creating new resources on the basis of existing ones are being investigated. We describe our data augmentation approach to improve the results of ASR models for low-resource and agglutinative languages.
arXiv Detail & Related papers (2022-07-14T12:49:15Z)
Discriminative Self-training for Punctuation Prediction [5.398944179152948]
Punctuation prediction for automatic speech recognition (ASR) output transcripts plays a crucial role for improving the readability of the ASR transcripts. achieving good performance on punctuation prediction often requires large amounts of labeled speech transcripts. We propose a Discriminative Self-Training approach with weighted loss and discriminative label smoothing to exploit unlabeled speech transcripts.
arXiv Detail & Related papers (2021-04-21T03:32:47Z)
You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model. Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline. We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures. Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.