Whisper Turns Stronger: Augmenting Wav2Vec 2.0 for Superior ASR in Low-Resource Languages
- URL: http://arxiv.org/abs/2501.00425v1
- Date: Tue, 31 Dec 2024 13:03:20 GMT
- Title: Whisper Turns Stronger: Augmenting Wav2Vec 2.0 for Superior ASR in Low-Resource Languages
- Authors: Or Haim Anidjar, Revital Marbel, Roi Yozevitch,
- Abstract summary: This paper introduces an end-to-end framework that enhances ASR systems fine-tuned on Wav2Vec2 through data augmentation techniques.
Our approach outperforms two previous baseline models, which are the pre-trained Wav2Vec2 and the well-known Whisper ASR model.
- Score: 1.3108652488669736
- License:
- Abstract: Approaching Speech-to-Text and Automatic Speech Recognition problems in low-resource languages is notoriously challenging due to the scarcity of validated datasets and the diversity of dialects. Arabic, Russian, and Portuguese exemplify these difficulties, being low-resource languages due to the many dialects of these languages across different continents worldwide. Moreover, the variety of accents and pronunciations of such languages complicate ASR models' success. With the increasing popularity of Deep Learning and Transformers, acoustic models like the renowned Wav2Vec2 have achieved superior performance in the Speech Recognition field compared to state-of-the-art approaches. However, despite Wav2Vec2's improved efficiency over traditional methods, its performance significantly declines for under-represented languages, even though it requires significantly less labeled data. This paper introduces an end-to-end framework that enhances ASR systems fine-tuned on Wav2Vec2 through data augmentation techniques. To validate our framework's effectiveness, we conducted a detailed experimental evaluation using three datasets from Mozilla's Common Voice project in Arabic, Russian, and Portuguese. Additionally, the framework presented in this paper demonstrates robustness to different diacritics. Ultimately, our approach outperforms two previous baseline models, which are the pre-trained Wav2Vec2 and the well-known Whisper ASR model, resulting in an average relative improvement of 33.9\% in Word Error Rate and a 53.2\% relative improvement in Character Error Rate.
Related papers
- Evaluating Standard and Dialectal Frisian ASR: Multilingual Fine-tuning and Language Identification for Improved Low-resource Performance [9.624005980086707]
State-of-the-art methods deploy self-supervised transfer learning where a model pre-trained on large amounts of data is fine-tuned using little labeled data.
We show that Frisian ASR performance can be improved by using multilingual fine-tuning data and an auxiliary language identification task.
arXiv Detail & Related papers (2025-02-07T12:42:46Z) - CODEOFCONDUCT at Multilingual Counterspeech Generation: A Context-Aware Model for Robust Counterspeech Generation in Low-Resource Languages [1.9263811967110864]
This paper introduces a context-aware model for robust counterspeech generation, which achieved significant success in the MCG-COLING-2025 shared task.
By leveraging a simulated annealing algorithm fine-tuned on multilingual datasets, the model generates factually accurate responses to hate speech.
We demonstrate state-of-the-art performance across four languages, with our system ranking first for Basque, second for Italian, and third for both English and Spanish.
arXiv Detail & Related papers (2025-01-01T03:36:31Z) - Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling [50.62091603179394]
Whisper, one of the most advanced ASR models, handles 99 languages effectively.
However, Whisper struggles with unseen languages, those not included in its pre-training.
We propose methods that exploit these relationships to enhance ASR performance on unseen languages.
arXiv Detail & Related papers (2024-12-21T04:05:43Z) - Whisper Finetuning on Nepali Language [0.0]
This research focuses on making an exhaustive and generalized dataset followed by fine-tuning OpenAI's Whisper models to improve transcription accuracy for the Nepali language.
We leverage publicly available ASR datasets and self-recorded custom datasets with a diverse range of accents, dialects, and speaking styles further enriched through augmentation.
Our approach outperforms Whisper's baseline models trained on Fleur's dataset, achieving WER reductions of up to 36.2% on the small and 23.8% on medium models.
arXiv Detail & Related papers (2024-11-19T15:55:56Z) - ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams [16.172599163455693]
We leverage high-quality data from linguistically or geographically related languages to improve TTS for the target language.
Second, we utilize low-quality Automatic Speech Recognition (ASR) data recorded in non-studio environments, which is refined using denoising and speech enhancement models.
Third, we apply knowledge distillation from large-scale models using synthetic data to generate more robust outputs.
arXiv Detail & Related papers (2024-10-23T14:18:25Z) - Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking [68.77659513993507]
We present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy.
Our results show spoken language identification accuracy improvements of 8.7% and 6.1%, respectively, and word error rates which are 3.3% and 2.0% lower on these benchmarks.
arXiv Detail & Related papers (2024-09-27T03:31:32Z) - Efficient Compression of Multitask Multilingual Speech Models [0.0]
DistilWhisper is able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities.
Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2.
arXiv Detail & Related papers (2024-05-02T03:11:59Z) - Hindi as a Second Language: Improving Visually Grounded Speech with
Semantically Similar Samples [89.16814518860357]
The objective of this work is to explore the learning of visually grounded speech models (VGS) from multilingual perspective.
Our key contribution in this work is to leverage the power of a high-resource language in a bilingual visually grounded speech model to improve the performance of a low-resource language.
arXiv Detail & Related papers (2023-03-30T16:34:10Z) - Adapt-and-Adjust: Overcoming the Long-Tail Problem of Multilingual
Speech Recognition [58.849768879796905]
We propose Adapt-and-Adjust (A2), a transformer-based multi-task learning framework for end-to-end multilingual speech recognition.
The A2 framework overcomes the long-tail problem via three techniques: (1) exploiting a pretrained multilingual language model (mBERT) to improve the performance of low-resource languages; (2) proposing dual adapters consisting of both language-specific and language-agnostic adaptation with minimal additional parameters; and (3) overcoming the class imbalance, either by imposing class priors in the loss during training or adjusting the logits of the softmax output during inference.
arXiv Detail & Related papers (2020-12-03T03:46:16Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.