Enhancing Whisper's Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization
- URL: http://arxiv.org/abs/2412.19785v1
- Date: Fri, 27 Dec 2024 18:32:24 GMT
- Title: Enhancing Whisper's Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization
- Authors: Kumud Tripathi, Raj Gothi, Pankaj Wasnik,
- Abstract summary: This paper explores two novel approaches to enhance Whisper's multilingual speech recognition performance in Indian languages.<n>First, we propose prompt-tuning with language family information, which enhances Whisper's accuracy in linguistically similar languages.<n>Second, we introduce a novel tokenizer that reduces the number of generated tokens, thereby accelerating Whisper's inference speed.
- Score: 2.403252956256118
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic speech recognition has recently seen a significant advancement with large foundational models such as Whisper. However, these models often struggle to perform well in low-resource languages, such as Indian languages. This paper explores two novel approaches to enhance Whisper's multilingual speech recognition performance in Indian languages. First, we propose prompt-tuning with language family information, which enhances Whisper's accuracy in linguistically similar languages. Second, we introduce a novel tokenizer that reduces the number of generated tokens, thereby accelerating Whisper's inference speed. Our extensive experiments demonstrate that the tokenizer significantly reduces inference time, while prompt-tuning enhances accuracy across various Whisper model sizes, including Small, Medium, and Large. Together, these techniques achieve a balance between optimal WER and inference speed.
Related papers
- Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling [50.62091603179394]
Whisper, one of the most advanced ASR models, handles 99 languages effectively.<n>However, Whisper struggles with unseen languages, those not included in its pre-training.<n>We propose methods that exploit these relationships to enhance ASR performance on unseen languages.
arXiv Detail & Related papers (2024-12-21T04:05:43Z) - A two-stage transliteration approach to improve performance of a multilingual ASR [1.9511556030544333]
This paper presents an approach to build a language-agnostic end-to-end model trained on a grapheme set.
We performed experiments with an end-to-end multilingual speech recognition system for two Indic languages.
arXiv Detail & Related papers (2024-10-09T05:30:33Z) - Meta-Whisper: Speech-Based Meta-ICL for ASR on Low-Resource Languages [51.12146889808824]
Meta-Whisper is a novel approach to improve automatic speech recognition for low-resource languages.
It enhances Whisper's ability to recognize speech in unfamiliar languages without extensive fine-tuning.
arXiv Detail & Related papers (2024-09-16T16:04:16Z) - Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text [22.19230427358921]
It is worth researching how to improve the performance of Whisper on under-represented languages.
We utilized easily accessible unpaired speech and text data and combined the language model GPT with Whisper on Kazakh.
We achieved more than 10% absolute WER reduction in multiple experiments.
arXiv Detail & Related papers (2024-08-10T13:39:13Z) - Cross-Lingual Transfer Learning for Speech Translation [7.802021866251242]
This paper examines how to expand the speech translation capability of speech foundation models with restricted data.
Whisper, a speech foundation model with strong performance on speech recognition and English translation, is used as the example model.
Using speech-to-speech retrieval to analyse the audio representations generated by the encoder, we show that utterances from different languages are mapped to a shared semantic space.
arXiv Detail & Related papers (2024-07-01T09:51:48Z) - Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation [45.29184681700463]
Speech models such as Whisper are trained with hundreds of thousands of hours of data, and thus learn a better speech-to-text decoder.
We propose Whisper-Flamingo which integrates visual features into the Whisper speech recognition and translation model with gated cross attention.
Our models achieve state-of-the-art ASR WER (0.68%) and AVSR WER (0.76%) on LRS3, and state-of-the-art ASR WER (1.3%) and AVSR WER (1.4%) on LRS2.
arXiv Detail & Related papers (2024-06-14T14:36:54Z) - Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper [51.12146889808824]
This research explores how the information of prompts interacts with the high-performing speech recognition model, Whisper.
Our results unexpectedly show that Whisper may not understand the textual prompts in a human-expected way.
It is also noted that English prompts generally outperform Mandarin ones on datasets of both languages.
arXiv Detail & Related papers (2024-06-09T14:44:59Z) - Keyword-Guided Adaptation of Automatic Speech Recognition [17.011087631073863]
We propose a novel approach for improved jargon word recognition by contextual biasing Whisper-based models.
We employ a keyword spotting model that leverages the Whisper encoder representation to dynamically generate prompts for guiding the decoder during the transcription process.
Our results show a significant improvement in the recognition accuracy of specified keywords and in reducing the overall word error rates.
arXiv Detail & Related papers (2024-06-04T14:20:38Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech
Models via Language-Specific Experts [14.999359332108767]
We propose DistilWhisper to bridge the performance gap in ASR for under-represented languages.
Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2.
Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters.
arXiv Detail & Related papers (2023-11-02T08:37:30Z) - MADGF: Multi-Agent Data Generation Framework [0.5700195008916903]
We present a novel Multi-Agent Data Generation Framework (MADGF) to address this challenge.
We finetune the open-source multilingual ASR model, Whisper, utilizing our generated Mixed Cantonese and English (MCE) audio dataset.
arXiv Detail & Related papers (2023-10-27T08:01:55Z) - Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot
Task Generalization [61.60501633397704]
We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering.
We design task-specific prompts, by either leveraging another large-scale model, or simply manipulating the special tokens in the default prompts.
Experiments show that our proposed prompts improve performance by 10% to 45% on the three zero-shot tasks, and even outperform SotA supervised models on some datasets.
arXiv Detail & Related papers (2023-05-18T16:32:58Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.