A Study of Multilingual End-to-End Speech Recognition for Kazakh,
Russian, and English
- URL: http://arxiv.org/abs/2108.01280v1
- Date: Tue, 3 Aug 2021 04:04:01 GMT
- Title: A Study of Multilingual End-to-End Speech Recognition for Kazakh,
Russian, and English
- Authors: Saida Mussakhojayeva, Yerbolat Khassanov, Huseyin Atakan Varol
- Abstract summary: We study training a single end-to-end (E2E) automatic speech recognition (ASR) model for three languages used in Kazakhstan: Kazakh, Russian, and English.
We first describe the development of multilingual E2E ASR based on Transformer networks and then perform an extensive assessment on the aforementioned languages.
- Score: 5.094176584161206
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study training a single end-to-end (E2E) automatic speech recognition
(ASR) model for three languages used in Kazakhstan: Kazakh, Russian, and
English. We first describe the development of multilingual E2E ASR based on
Transformer networks and then perform an extensive assessment on the
aforementioned languages. We also compare two variants of output grapheme set
construction: combined and independent. Furthermore, we evaluate the impact of
LMs and data augmentation techniques on the recognition performance of the
multilingual E2E ASR. In addition, we present several datasets for training and
evaluation purposes. Experiment results show that the multilingual models
achieve comparable performances to the monolingual baselines with a similar
number of parameters. Our best monolingual and multilingual models achieved
20.9% and 20.5% average word error rates on the combined test set,
respectively. To ensure the reproducibility of our experiments and results, we
share our training recipes, datasets, and pre-trained models.
Related papers
- Efficient Multilingual ASR Finetuning via LoRA Language Experts [59.27778147311189]
This paper proposes an efficient finetuning framework for customized multilingual ASR via prepared LoRA language experts based on Whisper.<n>Through LoRA expert fusion or knowledge distillation, our approach achieves better recognition performance on target languages than standard fine-tuning methods.<n> Experimental results demonstrate that the proposed models yield approximately 10% and 15% relative performance gains in language-aware and language-agnostic scenarios.
arXiv Detail & Related papers (2025-06-11T07:06:27Z) - Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis [4.774607166378613]
Self-supervised pre-training, combined with large language models (LLM), can effectively enhance ASR performance in low-resource scenarios.<n>We pre-train a Data2vec2 model on 300,000 hours of unlabeled dialect and accented speech data and do alignment training on a supervised dataset of 40,000 hours.
arXiv Detail & Related papers (2025-05-27T12:50:55Z) - Evaluation of Multilingual Image Captioning: How far can we get with CLIP models? [3.902360015414256]
This work presents several strategies, and extensive experiments, related to evaluating CLIPScore variants in multilingual settings.
Tests with machine-translated data show that multilingual CLIPScore models can maintain a high correlation with human judgements across different languages.
arXiv Detail & Related papers (2025-02-10T16:00:00Z) - CODEOFCONDUCT at Multilingual Counterspeech Generation: A Context-Aware Model for Robust Counterspeech Generation in Low-Resource Languages [1.9263811967110864]
This paper introduces a context-aware model for robust counterspeech generation, which achieved significant success in the MCG-COLING-2025 shared task.
By leveraging a simulated annealing algorithm fine-tuned on multilingual datasets, the model generates factually accurate responses to hate speech.
We demonstrate state-of-the-art performance across four languages, with our system ranking first for Basque, second for Italian, and third for both English and Spanish.
arXiv Detail & Related papers (2025-01-01T03:36:31Z) - Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling [50.62091603179394]
Whisper, one of the most advanced ASR models, handles 99 languages effectively.
However, Whisper struggles with unseen languages, those not included in its pre-training.
We propose methods that exploit these relationships to enhance ASR performance on unseen languages.
arXiv Detail & Related papers (2024-12-21T04:05:43Z) - STTATTS: Unified Speech-To-Text And Text-To-Speech Model [6.327929516375736]
We propose a parameter-efficient approach to learning ASR and TTS jointly via a multi-task learning objective and shared parameters.
Our evaluation demonstrates that the performance of our multi-task model is comparable to that of individually trained models.
arXiv Detail & Related papers (2024-10-24T10:04:24Z) - Scaling Laws for Multilingual Language Models [41.6318470003173]
A primary challenge in studying multilingual scaling is the difficulty of analyzing individual language performance due to cross-lingual transfer.
We introduce and validate a hypothesis that the test cross-entropy loss for each language family is determined solely by its own sampling ratio.
We derive a power-law relationship that links performance with dataset size, model size and sampling ratios.
arXiv Detail & Related papers (2024-10-15T20:29:38Z) - A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives [2.3592914313389257]
We are comparing monolingual Wav2Vec 2.0 models with various multilingual models to see whether we could improve speech recognition performance.
Our results suggest that monolingual speech recognition models are, in most cases, superior to multilingual models.
arXiv Detail & Related papers (2024-07-24T11:03:47Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Cross-Lingual Knowledge Distillation for Answer Sentence Selection in
Low-Resource Languages [90.41827664700847]
We propose Cross-Lingual Knowledge Distillation (CLKD) from a strong English AS2 teacher as a method to train AS2 models for low-resource languages.
To evaluate our method, we introduce 1) Xtr-WikiQA, a translation-based WikiQA dataset for 9 additional languages, and 2) TyDi-AS2, a multilingual AS2 dataset with over 70K questions spanning 8 typologically diverse languages.
arXiv Detail & Related papers (2023-05-25T17:56:04Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Improving Massively Multilingual ASR With Auxiliary CTC Objectives [40.10307386370194]
We introduce our work on improving performance on FLEURS, a 102-language open ASR benchmark.
We investigate techniques inspired from recent Connectionist Temporal Classification ( CTC) studies to help the model handle the large number of languages.
Our state-of-the-art systems using self-supervised models with the Conformer architecture improve over the results of prior work on FLEURS by a relative 28.4% CER.
arXiv Detail & Related papers (2023-02-24T18:59:51Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters [31.705705891482594]
We study training a single acoustic model for multiple languages with the aim of improving automatic speech recognition (ASR) performance on low-resource languages.
We compare three variants of multilingual training from a single joint model without knowing the input language, to using this information, to multiple heads.
We show that multilingual training of ASR models on several languages can improve recognition performance, in particular, on low resource languages.
arXiv Detail & Related papers (2020-07-06T18:43:38Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.