Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text
- URL: http://arxiv.org/abs/2408.05554v1
- Date: Sat, 10 Aug 2024 13:39:13 GMT
- Title: Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text
- Authors: Jinpeng Li, Yu Pu, Qi Sun, Wei-Qiang Zhang,
- Abstract summary: It is worth researching how to improve the performance of Whisper on under-represented languages.
We utilized easily accessible unpaired speech and text data and combined the language model GPT with Whisper on Kazakh.
We achieved more than 10% absolute WER reduction in multiple experiments.
- Score: 22.19230427358921
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Whisper and other large-scale automatic speech recognition models have made significant progress in performance. However, their performance on many low-resource languages, such as Kazakh, is not satisfactory. It is worth researching how to utilize low-cost data to improve the performance of Whisper on under-represented languages. In this study, we utilized easily accessible unpaired speech and text data and combined the language model GPT with Whisper on Kazakh. We implemented end of transcript (EOT) judgment modification and hallucination penalty to improve the performance of speech recognition. Further, we employed the decoding average token log probability as a criterion to select samples from unlabeled speech data and used pseudo-labeled data to fine-tune the model to further improve its performance. Ultimately, we achieved more than 10\% absolute WER reduction in multiple experiments, and the whole process has the potential to be generalized to other under-represented languages.
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Natural language guidance of high-fidelity text-to-speech with synthetic
annotations [13.642358232817342]
We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions.
We then apply this method to a 45k hour dataset, which we use to train a speech language model.
Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions.
arXiv Detail & Related papers (2024-02-02T21:29:34Z) - Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by
Self-Supervised Representation Mixing and Embedding Initialization [57.38123229553157]
This paper presents an effective transfer learning framework for language adaptation in text-to-speech systems.
We focus on achieving language adaptation using minimal labeled and unlabeled data.
Experimental results show that our framework is able to synthesize intelligible speech in unseen languages with only 4 utterances of labeled data and 15 minutes of unlabeled data.
arXiv Detail & Related papers (2024-01-23T21:55:34Z) - Leveraging Data Collection and Unsupervised Learning for Code-switched
Tunisian Arabic Automatic Speech Recognition [4.67385883375784]
This paper focuses on the Automatic Speech Recognition (ASR) challenge, focusing on the Tunisian dialect.
First, textual and audio data is collected and in some cases annotated.
Second, we explore self-supervision, semi-supervision and few-shot code-switching approaches to push the state-of-the-art on different Tunisian test sets.
Third, and given the absence of conventional spelling, we produce a human evaluation of our transcripts to avoid the noise coming from spelling in our testing references.
arXiv Detail & Related papers (2023-09-20T13:56:27Z) - KinSPEAK: Improving speech recognition for Kinyarwanda via
semi-supervised learning methods [3.3721926640077804]
We show that using self-supervised pre-training, following a simple curriculum schedule during fine-tuning and using semi-supervised learning significantly improve speech recognition performance for Kinyarwanda.
Our model achieves 3.2% word error rate (WER) on a new dataset and 15.6% WER on Mozilla Common Voice benchmark.
arXiv Detail & Related papers (2023-08-23T01:44:28Z) - Evaluating the Effectiveness of Natural Language Inference for Hate
Speech Detection in Languages with Limited Labeled Data [2.064612766965483]
Natural language inference (NLI) models which perform well in zero- and few-shot settings can benefit hate speech detection performance.
Our evaluation on five languages demonstrates large performance improvements of NLI fine-tuning over direct fine-tuning in the target language.
arXiv Detail & Related papers (2023-06-06T14:40:41Z) - Take the Hint: Improving Arabic Diacritization with
Partially-Diacritized Text [4.863310073296471]
We propose 2SDiac, a multi-source model that can effectively support optional diacritics in input to inform all predictions.
We also introduce Guided Learning, a training scheme to leverage given diacritics in input with different levels of random masking.
arXiv Detail & Related papers (2023-06-06T10:18:17Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Simple and Effective Unsupervised Speech Translation [68.25022245914363]
We study a simple and effective approach to build speech translation systems without labeled data.
We present an unsupervised domain adaptation technique for pre-trained speech models.
Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art.
arXiv Detail & Related papers (2022-10-18T22:26:13Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - Self-Supervised Representations Improve End-to-End Speech Translation [57.641761472372814]
We show that self-supervised pre-trained features can consistently improve the translation performance.
Cross-lingual transfer allows to extend to a variety of languages without or with little tuning.
arXiv Detail & Related papers (2020-06-22T10:28:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.