Gated Low-rank Adaptation for personalized Code-Switching Automatic Speech Recognition on the low-spec devices
- URL: http://arxiv.org/abs/2406.02562v1
- Date: Wed, 24 Apr 2024 01:31:39 GMT
- Title: Gated Low-rank Adaptation for personalized Code-Switching Automatic Speech Recognition on the low-spec devices
- Authors: Gwantae Kim, Bokyeung Lee, Donghyeon Kim, Hanseok Ko,
- Abstract summary: We introduce a gated low-rank adaptation(GLoRA) for parameter-efficient fine-tuning with minimal performance degradation.
Our experiments, conducted on Korean-English code-switching datasets, demonstrate that fine-tuning speech recognition models for code-switching surpasses the performance of traditional code-switching speech recognition models trained from scratch.
- Score: 28.06179341376626
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In recent times, there has been a growing interest in utilizing personalized large models on low-spec devices, such as mobile and CPU-only devices. However, utilizing a personalized large model in the on-device is inefficient, and sometimes limited due to computational cost. To tackle the problem, this paper presents the weights separation method to minimize on-device model weights using parameter-efficient fine-tuning methods. Moreover, some people speak multiple languages in an utterance, as known as code-switching, the personalized ASR model is necessary to address such cases. However, current multilingual speech recognition models are limited to recognizing a single language within each utterance. To tackle this problem, we propose code-switching speech recognition models that incorporate fine-tuned monolingual and multilingual speech recognition models. Additionally, we introduce a gated low-rank adaptation(GLoRA) for parameter-efficient fine-tuning with minimal performance degradation. Our experiments, conducted on Korean-English code-switching datasets, demonstrate that fine-tuning speech recognition models for code-switching surpasses the performance of traditional code-switching speech recognition models trained from scratch. Furthermore, GLoRA enhances parameter-efficient fine-tuning performance compared to conventional LoRA.
Related papers
- Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting [45.161909551392085]
We introduce an encoder prompting technique within the self-conditioned CTC framework, enabling language-specific adaptation of the CTC model in a zero-shot manner.
Our method has shown to significantly reduce errors by 28% on average and by 41% on low-resource languages.
arXiv Detail & Related papers (2024-06-18T13:38:58Z) - Leveraging Language ID to Calculate Intermediate CTC Loss for Enhanced
Code-Switching Speech Recognition [5.3545957730615905]
We introduce language identification information into the middle layer of the ASR model's encoder.
We aim to generate acoustic features that imply language distinctions in a more implicit way, reducing the model's confusion when dealing with language switching.
arXiv Detail & Related papers (2023-12-15T07:46:35Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - Hybrid Attention-based Encoder-decoder Model for Efficient Language Model Adaptation [13.16188747098854]
We propose a novel attention-based encoder-decoder (HAED) speech recognition model.
Our model separates the acoustic and language models, allowing for the use of conventional text-based language model adaptation techniques.
We demonstrate that the proposed HAED model yields 23% relative Word Error Rate (WER) improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2023-09-14T01:07:36Z) - Efficient Spoken Language Recognition via Multilabel Classification [53.662747523872305]
We show that our models obtain competitive results while being orders of magnitude smaller and faster than current state-of-the-art methods.
Our multilabel strategy is more robust to unseen non-target languages compared to multiclass classification.
arXiv Detail & Related papers (2023-06-02T23:04:19Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - Dyn-ASR: Compact, Multilingual Speech Recognition via Spoken Language
and Accent Identification [0.013428344011390777]
We propose a new approach to enable multilingual speech recognition on edge devices.
This approach uses both language identification and accent identification to select one of multiple monolingual ASR models on-the-fly.
Initial results for both recognition performance and resource usage are promising with our approach using less than 1/12th of the memory consumed by other solutions.
arXiv Detail & Related papers (2021-08-04T12:59:53Z) - How Phonotactics Affect Multilingual and Zero-shot ASR Performance [74.70048598292583]
A Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training.
We replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM.
We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer.
arXiv Detail & Related papers (2020-10-22T23:07:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.