Streaming End-to-End Bilingual ASR Systems with Joint Language
Identification
- URL: http://arxiv.org/abs/2007.03900v1
- Date: Wed, 8 Jul 2020 05:00:25 GMT
- Title: Streaming End-to-End Bilingual ASR Systems with Joint Language
Identification
- Authors: Surabhi Punjabi, Harish Arsikere, Zeynab Raeesy, Chander Chandak,
Nikhil Bhave, Ankish Bansal, Markus M\"uller, Sergio Murillo, Ariya Rastrow,
Sri Garimella, Roland Maas, Mat Hans, Athanasios Mouchtaris, Siegfried
Kunzmann
- Abstract summary: We introduce streaming, end-to-end, bilingual systems that perform both ASR and language identification.
The proposed method is applied to two language pairs: English-Spanish as spoken in the United States, and English-Hindi as spoken in India.
- Score: 19.09014345299161
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multilingual ASR technology simplifies model training and deployment, but its
accuracy is known to depend on the availability of language information at
runtime. Since language identity is seldom known beforehand in real-world
scenarios, it must be inferred on-the-fly with minimum latency. Furthermore, in
voice-activated smart assistant systems, language identity is also required for
downstream processing of ASR output. In this paper, we introduce streaming,
end-to-end, bilingual systems that perform both ASR and language identification
(LID) using the recurrent neural network transducer (RNN-T) architecture. On
the input side, embeddings from pretrained acoustic-only LID classifiers are
used to guide RNN-T training and inference, while on the output side, language
targets are jointly modeled with ASR targets. The proposed method is applied to
two language pairs: English-Spanish as spoken in the United States, and
English-Hindi as spoken in India. Experiments show that for English-Spanish,
the bilingual joint ASR-LID architecture matches monolingual ASR and
acoustic-only LID accuracies. For the more challenging (owing to
within-utterance code switching) case of English-Hindi, English ASR and LID
metrics show degradation. Overall, in scenarios where users switch dynamically
between languages, the proposed architecture offers a promising simplification
over running multiple monolingual ASR models and an LID classifier in parallel.
Related papers
- CL-MASR: A Continual Learning Benchmark for Multilingual ASR [15.974765568276615]
We propose CL-MASR, a benchmark for studying multilingual automatic speech recognition in a continual learning setting.
CL-MASR provides a diverse set of continual learning methods implemented on top of large-scale pretrained ASR models, along with common metrics.
To the best of our knowledge, CL-MASR is the first continual learning benchmark for the multilingual ASR task.
arXiv Detail & Related papers (2023-10-25T18:55:40Z) - Unified model for code-switching speech recognition and language
identification based on a concatenated tokenizer [17.700515986659063]
Code-Switching (CS) multilingual Automatic Speech Recognition (ASR) models can transcribe speech containing two or more alternating languages during a conversation.
This paper proposes a new method for creating code-switching ASR datasets from purely monolingual data sources.
A novel Concatenated Tokenizer enables ASR models to generate language ID for each emitted text token while reusing existing monolingual tokenizers.
arXiv Detail & Related papers (2023-06-14T21:24:11Z) - Efficient Spoken Language Recognition via Multilabel Classification [53.662747523872305]
We show that our models obtain competitive results while being orders of magnitude smaller and faster than current state-of-the-art methods.
Our multilabel strategy is more robust to unseen non-target languages compared to multiclass classification.
arXiv Detail & Related papers (2023-06-02T23:04:19Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Learning ASR pathways: A sparse multilingual ASR model [31.147484652643282]
We present ASR pathways, a sparse multilingual ASR model that activates language-specific sub-networks ("pathways")
With the overlapping sub-networks, the shared parameters can also enable knowledge transfer for lower-resource languages via joint multilingual training.
Our proposed ASR pathways outperform both dense models and a language-agnostically pruned model, and provide better performance on low-resource languages.
arXiv Detail & Related papers (2022-09-13T05:14:08Z) - LAE: Language-Aware Encoder for Monolingual and Multilingual ASR [87.74794847245536]
A novel language-aware encoder (LAE) architecture is proposed to handle both situations by disentangling language-specific information.
Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level.
arXiv Detail & Related papers (2022-06-05T04:03:12Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - Low-Resource Spoken Language Identification Using Self-Attentive Pooling
and Deep 1D Time-Channel Separable Convolutions [0.0]
We show that a convolutional neural network with a Self-Attentive Pooling layer shows promising results in low-resource setting for the language identification task.
We also substantiate the hypothesis that whenever the dataset is diverse enough so that the other classification factors, like gender, age etc. are well-averaged, the confusion matrix for LID system bears the language similarity measure.
arXiv Detail & Related papers (2021-05-31T18:35:27Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z) - How Phonotactics Affect Multilingual and Zero-shot ASR Performance [74.70048598292583]
A Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training.
We replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM.
We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer.
arXiv Detail & Related papers (2020-10-22T23:07:24Z) - Streaming Language Identification using Combination of Acoustic
Representations and ASR Hypotheses [13.976935216584298]
A common approach to solve multilingual speech recognition is to run multiple monolingual ASR systems in parallel.
We propose an approach that learns and combines acoustic level representations with embeddings estimated on ASR hypotheses.
To reduce the processing cost and latency, we exploit a streaming architecture to identify the spoken language early.
arXiv Detail & Related papers (2020-06-01T04:08:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.