How Phonotactics Affect Multilingual and Zero-shot ASR Performance
- URL: http://arxiv.org/abs/2010.12104v2
- Date: Wed, 10 Feb 2021 18:53:38 GMT
- Title: How Phonotactics Affect Multilingual and Zero-shot ASR Performance
- Authors: Siyuan Feng, Piotr \.Zelasko, Laureano Moro-Vel\'azquez, Ali
Abavisani, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak
- Abstract summary: A Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training.
We replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM.
We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer.
- Score: 74.70048598292583
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The idea of combining multiple languages' recordings to train a single
automatic speech recognition (ASR) model brings the promise of the emergence of
universal speech representation. Recently, a Transformer encoder-decoder model
has been shown to leverage multilingual data well in IPA transcriptions of
languages presented during training. However, the representations it learned
were not successful in zero-shot transfer to unseen languages. Because that
model lacks an explicit factorization of the acoustic model (AM) and language
model (LM), it is unclear to what degree the performance suffered from
differences in pronunciation or the mismatch in phonotactics. To gain more
insight into the factors limiting zero-shot ASR transfer, we replace the
encoder-decoder with a hybrid ASR system consisting of a separate AM and LM.
Then, we perform an extensive evaluation of monolingual, multilingual, and
crosslingual (zero-shot) acoustic and language models on a set of 13
phonetically diverse languages. We show that the gain from modeling
crosslingual phonotactics is limited, and imposing a too strong model can hurt
the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a
multilingual ASR system's performance, and retaining only the target language's
phonotactic data in LM training is preferable.
Related papers
- Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Learning ASR pathways: A sparse multilingual ASR model [31.147484652643282]
We present ASR pathways, a sparse multilingual ASR model that activates language-specific sub-networks ("pathways")
With the overlapping sub-networks, the shared parameters can also enable knowledge transfer for lower-resource languages via joint multilingual training.
Our proposed ASR pathways outperform both dense models and a language-agnostically pruned model, and provide better performance on low-resource languages.
arXiv Detail & Related papers (2022-09-13T05:14:08Z) - Investigating the Impact of Cross-lingual Acoustic-Phonetic Similarities
on Multilingual Speech Recognition [31.575930914290762]
A novel data-driven approach is proposed to investigate the cross-lingual acoustic-phonetic similarities.
Deep neural networks are trained as mapping networks to transform the distributions from different acoustic models into a directly comparable form.
A relative improvement of 8% over monolingual counterpart is achieved.
arXiv Detail & Related papers (2022-07-07T15:55:41Z) - Distilling a Pretrained Language Model to a Multilingual ASR Model [3.4012007729454816]
We distill the rich knowledge embedded inside a well-trained teacher text model to the student speech model.
We show the superiority of our method on 20 low-resource languages of the CommonVoice dataset with less than 100 hours of speech data.
arXiv Detail & Related papers (2022-06-25T12:36:11Z) - LAE: Language-Aware Encoder for Monolingual and Multilingual ASR [87.74794847245536]
A novel language-aware encoder (LAE) architecture is proposed to handle both situations by disentangling language-specific information.
Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level.
arXiv Detail & Related papers (2022-06-05T04:03:12Z) - Learning Disentangled Semantic Representations for Zero-Shot
Cross-Lingual Transfer in Multilingual Machine Reading Comprehension [40.38719019711233]
Multilingual pre-trained models are able to zero-shot transfer knowledge from rich-resource languages to low-resource languages in machine reading comprehension (MRC)
In this paper, we propose a novel multilingual MRC framework equipped with a Siamese Semantic Disentanglement Model (SSDM) to disassociate semantics from syntax in representations learned by multilingual pre-trained models.
arXiv Detail & Related papers (2022-04-03T05:26:42Z) - Discovering Phonetic Inventories with Crosslingual Automatic Speech
Recognition [71.49308685090324]
This paper investigates the influence of different factors (i.e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language.
We find that unique sounds, similar sounds, and tone languages remain a major challenge for phonetic inventory discovery.
arXiv Detail & Related papers (2022-01-26T22:12:55Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.