Pushing the performances of ASR models on English and Spanish accents
- URL: http://arxiv.org/abs/2212.12048v1
- Date: Thu, 22 Dec 2022 21:48:29 GMT
- Title: Pushing the performances of ASR models on English and Spanish accents
- Authors: Pooja Chitkara, Morgane Riviere, Jade Copet, Frank Zhang, Yatharth
Saraf
- Abstract summary: We show how pre-trained embeddings and auxiliary classification losses can improve the performance of ASR systems.
We are looking for upgrades as universal as possible and therefore we will explore their impact on several models architectures and several languages.
- Score: 13.031634839137773
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech to text models tend to be trained and evaluated against a single
target accent. This is especially true for English for which native speakers
from the United States became the main benchmark. In this work, we are going to
show how two simple methods: pre-trained embeddings and auxiliary
classification losses can improve the performance of ASR systems. We are
looking for upgrades as universal as possible and therefore we will explore
their impact on several models architectures and several languages.
Related papers
- Improving Self-supervised Pre-training using Accent-Specific Codebooks [48.409296549372414]
accent-aware adaptation technique for self-supervised learning.
On the Mozilla Common Voice dataset, our proposed approach outperforms all other accent-adaptation approaches.
arXiv Detail & Related papers (2024-07-04T08:33:52Z) - Efficient Compression of Multitask Multilingual Speech Models [0.0]
DistilWhisper is able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities.
Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2.
arXiv Detail & Related papers (2024-05-02T03:11:59Z) - Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech
Models via Language-Specific Experts [14.999359332108767]
We propose DistilWhisper to bridge the performance gap in ASR for under-represented languages.
Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2.
Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters.
arXiv Detail & Related papers (2023-11-02T08:37:30Z) - Accented Speech Recognition With Accent-specific Codebooks [53.288874858671576]
Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems.
Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR.
We propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks.
arXiv Detail & Related papers (2023-10-24T16:10:58Z) - Adapting Multi-Lingual ASR Models for Handling Multiple Talkers [63.151811561972515]
State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages.
We propose an approach to adapt USMs for multi-talker ASR.
We first develop an enhanced version of serialized output training to jointly perform multi-talker ASR and utterance timestamp prediction.
arXiv Detail & Related papers (2023-05-30T05:05:52Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - AccentDB: A Database of Non-Native English Accents to Assist Neural
Speech Recognition [3.028098724882708]
We first spell out the key requirements for creating a well-curated database of speech samples in non-native accents for training and testing robust ASR systems.
We then introduce AccentDB, one such database that contains samples of 4 Indian-English accents collected by us.
We present several accent classification models and evaluate them thoroughly against human-labelled accent classes.
arXiv Detail & Related papers (2020-05-16T12:38:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.