Improving Zero-Shot Chinese-English Code-Switching ASR with kNN-CTC and Gated Monolingual Datastores
- URL: http://arxiv.org/abs/2406.03814v2
- Date: Fri, 14 Jun 2024 02:36:39 GMT
- Title: Improving Zero-Shot Chinese-English Code-Switching ASR with kNN-CTC and Gated Monolingual Datastores
- Authors: Jiaming Zhou, Shiwan Zhao, Hui Wang, Tian-Hao Zhang, Haoqin Sun, Xuechen Wang, Yong Qin,
- Abstract summary: We propose a novel kNN-CTC-based code-switching ASR (CS-ASR) framework that employs dual monolingual datastores and a gated datastore selection mechanism.
Our method selects the appropriate datastore for decoding each frame, ensuring the injection of language-specific information into the ASR process.
- Score: 14.150602045545108
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The kNN-CTC model has proven to be effective for monolingual automatic speech recognition (ASR). However, its direct application to multilingual scenarios like code-switching, presents challenges. Although there is potential for performance improvement, a kNN-CTC model utilizing a single bilingual datastore can inadvertently introduce undesirable noise from the alternative language. To address this, we propose a novel kNN-CTC-based code-switching ASR (CS-ASR) framework that employs dual monolingual datastores and a gated datastore selection mechanism to reduce noise interference. Our method selects the appropriate datastore for decoding each frame, ensuring the injection of language-specific information into the ASR process. We apply this framework to cutting-edge CTC-based models, developing an advanced CS-ASR system. Extensive experiments demonstrate the remarkable effectiveness of our gated datastore mechanism in enhancing the performance of zero-shot Chinese-English CS-ASR.
Related papers
- Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting [45.161909551392085]
We introduce an encoder prompting technique within the self-conditioned CTC framework, enabling language-specific adaptation of the CTC model in a zero-shot manner.
Our method has shown to significantly reduce errors by 28% on average and by 41% on low-resource languages.
arXiv Detail & Related papers (2024-06-18T13:38:58Z) - Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - Low-resource speech recognition and dialect identification of Irish in a multi-task framework [7.981589711420179]
This paper explores the use of Hybrid CTC/Attention encoder-decoder models trained with Intermediate CTC (Inter CTC) for Irish (Gaelic) low-resource speech recognition (ASR) and dialect identification (DID)
Results are compared to the current best performing models trained for ASR (TDNN-HMM) and DIDECA (PA-TDNN)
arXiv Detail & Related papers (2024-05-02T13:54:39Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - Speech collage: code-switched audio generation by collaging monolingual
corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments.
We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z) - Unified model for code-switching speech recognition and language
identification based on a concatenated tokenizer [17.700515986659063]
Code-Switching (CS) multilingual Automatic Speech Recognition (ASR) models can transcribe speech containing two or more alternating languages during a conversation.
This paper proposes a new method for creating code-switching ASR datasets from purely monolingual data sources.
A novel Concatenated Tokenizer enables ASR models to generate language ID for each emitted text token while reusing existing monolingual tokenizers.
arXiv Detail & Related papers (2023-06-14T21:24:11Z) - CSCD-NS: a Chinese Spelling Check Dataset for Native Speakers [62.61866477815883]
We present CSCD-NS, the first Chinese spelling check dataset designed for native speakers.
CSCD-NS is ten times larger in scale and exhibits a distinct error distribution.
We propose a novel method that simulates the input process through an input method.
arXiv Detail & Related papers (2022-11-16T09:25:42Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Optimizing Bilingual Neural Transducer with Synthetic Code-switching
Text Generation [10.650573361117669]
Semi-supervised training and synthetic code-switched data can improve the bilingual ASR system on code-switching speech.
Our final system achieves 25% mixed error rate (MER) on the ASCEND English/Mandarin code-switching test set.
arXiv Detail & Related papers (2022-10-21T19:42:41Z) - Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces
and Conformers [33.725831884078744]
The proposed CTC-CRF framework inherits the data-efficiency of the hybrid approach and the simplicity of the end-to-end approach.
We investigate techniques to enable the recently developed wordpiece modeling units and Conformer neural networks to be succesfully applied in CTC-CRFs.
arXiv Detail & Related papers (2021-07-07T04:12:06Z) - Rnn-transducer with language bias for end-to-end Mandarin-English
code-switching speech recognition [58.105818353866354]
We propose an improved recurrent neural network transducer (RNN-T) model with language bias to alleviate the problem.
We use the language identities to bias the model to predict the CS points.
This promotes the model to learn the language identity information directly from transcription, and no additional LID model is needed.
arXiv Detail & Related papers (2020-02-19T12:01:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.