Scaling Up Deliberation for Multilingual ASR
- URL: http://arxiv.org/abs/2210.05785v1
- Date: Tue, 11 Oct 2022 21:07:00 GMT
- Title: Scaling Up Deliberation for Multilingual ASR
- Authors: Ke Hu, Bo Li, Tara N. Sainath
- Abstract summary: We investigate second-pass deliberation for multilingual speech recognition.
Our proposed deliberation is multilingual, i.e., the text encoder encodes hypothesis text from multiple languages, and the decoder attends to multilingual text and audio.
We show that deliberation improves the average WER on 9 languages by 4% relative compared to the single-pass model.
- Score: 36.860327600638705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multilingual end-to-end automatic speech recognition models are attractive
due to its simplicity in training and deployment. Recent work on large-scale
training of such models has shown promising results compared to monolingual
models. However, the work often focuses on multilingual models themselves in a
single-pass setup. In this work, we investigate second-pass deliberation for
multilingual speech recognition. Our proposed deliberation is multilingual,
i.e., the text encoder encodes hypothesis text from multiple languages, and the
decoder attends to multilingual text and audio. We investigate scaling the
deliberation text encoder and decoder, and compare scaling the deliberation
decoder and the first-pass cascaded encoder. We show that deliberation improves
the average WER on 9 languages by 4% relative compared to the single-pass
model. By increasing the size of the deliberation up to 1B parameters, the
average WER improvement increases to 9%, with up to 14% for certain languages.
Our deliberation rescorer is based on transformer layers and can be
parallelized during rescoring.
Related papers
- Streaming Bilingual End-to-End ASR model using Attention over Multiple
Softmax [6.386371634323785]
We propose a novel bilingual end-to-end (E2E) modeling approach, where a single neural model can recognize both languages.
The proposed model has shared encoder and prediction networks, with language-specific joint networks that are combined via a self-attention mechanism.
arXiv Detail & Related papers (2024-01-22T01:44:42Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - Bilingual Streaming ASR with Grapheme units and Auxiliary Monolingual
Loss [11.447307867370064]
We introduce a bilingual solution to support English as secondary locale for most primary locales in automatic speech recognition (ASR)
Our key developments constitute: (a) pronunciation lexicon with grapheme units instead of phone units, (b) a fully bilingual alignment model and subsequently bilingual streaming transformer model.
We evaluate our work on large-scale training and test tasks for bilingual Spanish (ES) and bilingual Italian (IT) applications.
arXiv Detail & Related papers (2023-08-11T18:06:33Z) - Improved Cross-Lingual Transfer Learning For Automatic Speech
Translation [18.97234151624098]
We show that by initializing the encoder of the encoder-decoder sequence-to-sequence translation model with SAMU-XLS-R, we achieve significantly better cross-lingual task knowledge transfer.
We demonstrate the effectiveness of our approach on two popular datasets, namely, CoVoST-2 and Europarl.
arXiv Detail & Related papers (2023-06-01T15:19:06Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - Breaking Down Multilingual Machine Translation [74.24795388967907]
We show that multilingual training is beneficial to encoders in general, while it only benefits decoders for low-resource languages (LRLs)
Our many-to-one models for high-resource languages and one-to-many models for LRLs outperform the best results reported by Aharoni et al.
arXiv Detail & Related papers (2021-10-15T14:57:12Z) - How Phonotactics Affect Multilingual and Zero-shot ASR Performance [74.70048598292583]
A Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training.
We replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM.
We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer.
arXiv Detail & Related papers (2020-10-22T23:07:24Z) - One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech [3.42658286826597]
We introduce an approach to multilingual speech synthesis which uses the meta-learning concept of contextual parameter generation.
Our model is shown to effectively share information across languages and according to a subjective evaluation test, it produces more natural and accurate code-switching speech than the baselines.
arXiv Detail & Related papers (2020-08-03T10:43:30Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.