Joint Modeling of Code-Switched and Monolingual ASR via Conditional
Factorization
- URL: http://arxiv.org/abs/2111.15016v1
- Date: Mon, 29 Nov 2021 23:14:54 GMT
- Title: Joint Modeling of Code-Switched and Monolingual ASR via Conditional
Factorization
- Authors: Brian Yan, Chunlei Zhang, Meng Yu, Shi-Xiong Zhang, Siddharth Dalmia,
Dan Berrebbi, Chao Weng, Shinji Watanabe, Dong Yu
- Abstract summary: We propose a general framework to jointly model the likelihoods of the monolingual and code-switch sub-tasks that comprise bilingual speech recognition.
We demonstrate the efficacy of our proposed model on bilingual Mandarin-English speech recognition across both monolingual and code-switched corpora.
- Score: 75.98664099579392
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conversational bilingual speech encompasses three types of utterances: two
purely monolingual types and one intra-sententially code-switched type. In this
work, we propose a general framework to jointly model the likelihoods of the
monolingual and code-switch sub-tasks that comprise bilingual speech
recognition. By defining the monolingual sub-tasks with label-to-frame
synchronization, our joint modeling framework can be conditionally factorized
such that the final bilingual output, which may or may not be code-switched, is
obtained given only monolingual information. We show that this conditionally
factorized joint framework can be modeled by an end-to-end differentiable
neural network. We demonstrate the efficacy of our proposed model on bilingual
Mandarin-English speech recognition across both monolingual and code-switched
corpora.
Related papers
- MulliVC: Multi-lingual Voice Conversion With Cycle Consistency [75.59590240034261]
MulliVC is a novel voice conversion system that only converts timbre and keeps original content and source language prosody without multi-lingual paired data.
Both objective and subjective results indicate that MulliVC significantly surpasses other methods in both monolingual and cross-lingual contexts.
arXiv Detail & Related papers (2024-08-08T18:12:51Z) - Streaming Bilingual End-to-End ASR model using Attention over Multiple
Softmax [6.386371634323785]
We propose a novel bilingual end-to-end (E2E) modeling approach, where a single neural model can recognize both languages.
The proposed model has shared encoder and prediction networks, with language-specific joint networks that are combined via a self-attention mechanism.
arXiv Detail & Related papers (2024-01-22T01:44:42Z) - Adapting the adapters for code-switching in multilingual ASR [10.316724084739892]
Large pre-trained multilingual speech models have shown potential in scaling Automatic Speech Recognition to many low-resource languages.
Some of these models employ language adapters in their formulation, which helps to improve monolingual performance.
This formulation restricts the usability of these models on code-switched speech, where two languages are mixed together in the same utterance.
We propose ways to effectively fine-tune such models on code-switched speech, by assimilating information from both language adapters at each language adaptation point in the network.
arXiv Detail & Related papers (2023-10-11T12:15:24Z) - Unified model for code-switching speech recognition and language
identification based on a concatenated tokenizer [17.700515986659063]
Code-Switching (CS) multilingual Automatic Speech Recognition (ASR) models can transcribe speech containing two or more alternating languages during a conversation.
This paper proposes a new method for creating code-switching ASR datasets from purely monolingual data sources.
A novel Concatenated Tokenizer enables ASR models to generate language ID for each emitted text token while reusing existing monolingual tokenizers.
arXiv Detail & Related papers (2023-06-14T21:24:11Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Towards Zero-Shot Code-Switched Speech Recognition [44.76492452463019]
We seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot setting.
We propose to simplify each monolingual module by allowing them to transcribe all speech segments indiscriminately with a monolingual script.
We apply this transliteration-based approach in an end-to-end differentiable neural network and demonstrate its efficacy for zero-shot CS ASR on Mandarin-English SEAME test sets.
arXiv Detail & Related papers (2022-11-02T19:52:54Z) - LAE: Language-Aware Encoder for Monolingual and Multilingual ASR [87.74794847245536]
A novel language-aware encoder (LAE) architecture is proposed to handle both situations by disentangling language-specific information.
Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level.
arXiv Detail & Related papers (2022-06-05T04:03:12Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario [10.779568857641928]
This paper presents an extension on Tacotron2 to achieve bilingual multispeaker speech synthesis.
We achieve cross-lingual synthesis, including code-switching cases, between English and Mandarin for monolingual speakers.
arXiv Detail & Related papers (2020-05-21T03:03:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.