Monolingual Recognizers Fusion for Code-switching Speech Recognition
- URL: http://arxiv.org/abs/2211.01046v1
- Date: Wed, 2 Nov 2022 11:24:26 GMT
- Title: Monolingual Recognizers Fusion for Code-switching Speech Recognition
- Authors: Tongtong Song, Qiang Xu, Haoyu Lu, Longbiao Wang, Hao Shi, Yuqin Lin,
Yanbing Yang, Jianwu Dang
- Abstract summary: We propose a monolingual recognizers fusion method for CS ASR.
It has two stages: the speech awareness stage and the language fusion stage.
Experiments on a Mandarin-English corpus show the efficiency of the proposed method.
- Score: 43.38810173824711
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The bi-encoder structure has been intensively investigated in code-switching
(CS) automatic speech recognition (ASR). However, most existing methods require
the structures of two monolingual ASR models (MAMs) should be the same and only
use the encoder of MAMs. This leads to the problem that pre-trained MAMs cannot
be timely and fully used for CS ASR. In this paper, we propose a monolingual
recognizers fusion method for CS ASR. It has two stages: the speech awareness
(SA) stage and the language fusion (LF) stage. In the SA stage, acoustic
features are mapped to two language-specific predictions by two independent
MAMs. To keep the MAMs focused on their own language, we further extend the
language-aware training strategy for the MAMs. In the LF stage, the BELM fuses
two language-specific predictions to get the final prediction. Moreover, we
propose a text simulation strategy to simplify the training process of the BELM
and reduce reliance on CS data. Experiments on a Mandarin-English corpus show
the efficiency of the proposed method. The mix error rate is significantly
reduced on the test set after using open-source pre-trained MAMs.
Related papers
- Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM [1.3089936156875277]
We introduce a speech-conditioned Large Language Model (LLM) integrated with a Mixture of Experts (MoE) based connector.
We propose an Insertion and Deletion of Interruption Token (IDIT) mechanism for better transfer text generation ability of LLM to speech recognition task.
We also present a connecter with MoE architecture that manages multiple languages efficiently.
arXiv Detail & Related papers (2024-09-24T09:20:22Z) - It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z) - Unified model for code-switching speech recognition and language
identification based on a concatenated tokenizer [17.700515986659063]
Code-Switching (CS) multilingual Automatic Speech Recognition (ASR) models can transcribe speech containing two or more alternating languages during a conversation.
This paper proposes a new method for creating code-switching ASR datasets from purely monolingual data sources.
A novel Concatenated Tokenizer enables ASR models to generate language ID for each emitted text token while reusing existing monolingual tokenizers.
arXiv Detail & Related papers (2023-06-14T21:24:11Z) - Adapting Multi-Lingual ASR Models for Handling Multiple Talkers [63.151811561972515]
State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages.
We propose an approach to adapt USMs for multi-talker ASR.
We first develop an enhanced version of serialized output training to jointly perform multi-talker ASR and utterance timestamp prediction.
arXiv Detail & Related papers (2023-05-30T05:05:52Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Bridging the Gap between Language Models and Cross-Lingual Sequence
Labeling [101.74165219364264]
Large-scale cross-lingual pre-trained language models (xPLMs) have shown effectiveness in cross-lingual sequence labeling tasks.
Despite the great success, we draw an empirical observation that there is a training objective gap between pre-training and fine-tuning stages.
In this paper, we first design a pre-training task tailored for xSL named Cross-lingual Language Informative Span Masking (CLISM) to eliminate the objective gap.
Second, we present ContrAstive-Consistency Regularization (CACR), which utilizes contrastive learning to encourage the consistency between representations of input parallel
arXiv Detail & Related papers (2022-04-11T15:55:20Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z) - MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation [27.19320167337675]
We propose a technique to learn a robust speech encoder in a self-supervised fashion only on the speech side.
This technique termed Masked Acoustic Modeling (MAM) not only provides an alternative solution to improving E2E-ST, but also can perform pre-training on any acoustic signals.
In the setting without using any transcriptions, our technique achieves an average improvement of +1.1 BLEU, and +2.3 BLEU with MAM pre-training.
arXiv Detail & Related papers (2020-10-22T05:02:06Z) - Streaming End-to-End Bilingual ASR Systems with Joint Language
Identification [19.09014345299161]
We introduce streaming, end-to-end, bilingual systems that perform both ASR and language identification.
The proposed method is applied to two language pairs: English-Spanish as spoken in the United States, and English-Hindi as spoken in India.
arXiv Detail & Related papers (2020-07-08T05:00:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.