Related papers: Enhancing Whisper's Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization

Enhancing Whisper's Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization

URL: http://arxiv.org/abs/2412.19785v1
Date: Fri, 27 Dec 2024 18:32:24 GMT
Title: Enhancing Whisper's Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization
Authors: Kumud Tripathi, Raj Gothi, Pankaj Wasnik,
Abstract summary: This paper explores two novel approaches to enhance Whisper's multilingual speech recognition performance in Indian languages.<n>First, we propose prompt-tuning with language family information, which enhances Whisper's accuracy in linguistically similar languages.<n>Second, we introduce a novel tokenizer that reduces the number of generated tokens, thereby accelerating Whisper's inference speed.
Score: 2.403252956256118
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automatic speech recognition has recently seen a significant advancement with large foundational models such as Whisper. However, these models often struggle to perform well in low-resource languages, such as Indian languages. This paper explores two novel approaches to enhance Whisper's multilingual speech recognition performance in Indian languages. First, we propose prompt-tuning with language family information, which enhances Whisper's accuracy in linguistically similar languages. Second, we introduce a novel tokenizer that reduces the number of generated tokens, thereby accelerating Whisper's inference speed. Our extensive experiments demonstrate that the tokenizer significantly reduces inference time, while prompt-tuning enhances accuracy across various Whisper model sizes, including Small, Medium, and Large. Together, these techniques achieve a balance between optimal WER and inference speed.

Related papers

Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling [50.62091603179394]
Whisper, one of the most advanced ASR models, handles 99 languages effectively.<n>However, Whisper struggles with unseen languages, those not included in its pre-training.<n>We propose methods that exploit these relationships to enhance ASR performance on unseen languages.
arXiv Detail & Related papers (2024-12-21T04:05:43Z)
A two-stage transliteration approach to improve performance of a multilingual ASR [1.9511556030544333]
This paper presents an approach to build a language-agnostic end-to-end model trained on a grapheme set. We performed experiments with an end-to-end multilingual speech recognition system for two Indic languages.
arXiv Detail & Related papers (2024-10-09T05:30:33Z)
Meta-Whisper: Speech-Based Meta-ICL for ASR on Low-Resource Languages [51.12146889808824]
Meta-Whisper is a novel approach to improve automatic speech recognition for low-resource languages. It enhances Whisper's ability to recognize speech in unfamiliar languages without extensive fine-tuning.
arXiv Detail & Related papers (2024-09-16T16:04:16Z)
Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text [22.19230427358921]
It is worth researching how to improve the performance of Whisper on under-represented languages. We utilized easily accessible unpaired speech and text data and combined the language model GPT with Whisper on Kazakh. We achieved more than 10% absolute WER reduction in multiple experiments.
arXiv Detail & Related papers (2024-08-10T13:39:13Z)
Cross-Lingual Transfer Learning for Speech Translation [7.802021866251242]
This paper examines how to expand the speech translation capability of speech foundation models with restricted data. Whisper, a speech foundation model with strong performance on speech recognition and English translation, is used as the example model. Using speech-to-speech retrieval to analyse the audio representations generated by the encoder, we show that utterances from different languages are mapped to a shared semantic space.
arXiv Detail & Related papers (2024-07-01T09:51:48Z)
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation [45.29184681700463]
Speech models such as Whisper are trained with hundreds of thousands of hours of data, and thus learn a better speech-to-text decoder. We propose Whisper-Flamingo which integrates visual features into the Whisper speech recognition and translation model with gated cross attention. Our models achieve state-of-the-art ASR WER (0.68%) and AVSR WER (0.76%) on LRS3, and state-of-the-art ASR WER (1.3%) and AVSR WER (1.4%) on LRS2.
arXiv Detail & Related papers (2024-06-14T14:36:54Z)
Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper [51.12146889808824]
This research explores how the information of prompts interacts with the high-performing speech recognition model, Whisper. Our results unexpectedly show that Whisper may not understand the textual prompts in a human-expected way. It is also noted that English prompts generally outperform Mandarin ones on datasets of both languages.
arXiv Detail & Related papers (2024-06-09T14:44:59Z)
Keyword-Guided Adaptation of Automatic Speech Recognition [17.011087631073863]
We propose a novel approach for improved jargon word recognition by contextual biasing Whisper-based models. We employ a keyword spotting model that leverages the Whisper encoder representation to dynamically generate prompts for guiding the decoder during the transcription process. Our results show a significant improvement in the recognition accuracy of specified keywords and in reducing the overall word error rates.
arXiv Detail & Related papers (2024-06-04T14:20:38Z)
Multilingual self-supervised speech representations improve the speech recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%. In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z)
Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts [14.999359332108767]
We propose DistilWhisper to bridge the performance gap in ASR for under-represented languages. Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2. Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters.
arXiv Detail & Related papers (2023-11-02T08:37:30Z)
MADGF: Multi-Agent Data Generation Framework [0.5700195008916903]
We present a novel Multi-Agent Data Generation Framework (MADGF) to address this challenge. We finetune the open-source multilingual ASR model, Whisper, utilizing our generated Mixed Cantonese and English (MCE) audio dataset.
arXiv Detail & Related papers (2023-10-27T08:01:55Z)
Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization [61.60501633397704]
We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering. We design task-specific prompts, by either leveraging another large-scale model, or simply manipulating the special tokens in the default prompts. Experiments show that our proposed prompts improve performance by 10% to 45% on the three zero-shot tasks, and even outperform SotA supervised models on some datasets.
arXiv Detail & Related papers (2023-05-18T16:32:58Z)
LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure. We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z)
Language-agnostic Code-Switching in Sequence-To-Sequence Speech Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages. We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed. We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z)
Cross-lingual Transfer for Speech Processing using Acoustic Language Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide. Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages. We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.