Non-autoregressive Mandarin-English Code-switching Speech Recognition
with Pinyin Mask-CTC and Word Embedding Regularization
- URL: http://arxiv.org/abs/2104.02258v1
- Date: Tue, 6 Apr 2021 03:01:09 GMT
- Title: Non-autoregressive Mandarin-English Code-switching Speech Recognition
with Pinyin Mask-CTC and Word Embedding Regularization
- Authors: Shun-Po Chuang, Heng-Jui Chang, Sung-Feng Huang, Hung-yi Lee
- Abstract summary: Mandarin-English code-switching (CS) is frequently used among East and Southeast Asian people.
Recent successful non-autoregressive (NAR) ASR models remove the need for left-to-right beam decoding in autoregressive (AR) models.
We propose changing the Mandarin output target of the encoder to Pinyin for faster encoder training, and introduce Pinyin-to-Mandarin decoder to learn contextualized information.
- Score: 61.749126838659315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mandarin-English code-switching (CS) is frequently used among East and
Southeast Asian people. However, the intra-sentence language switching of the
two very different languages makes recognizing CS speech challenging.
Meanwhile, the recent successful non-autoregressive (NAR) ASR models remove the
need for left-to-right beam decoding in autoregressive (AR) models and achieved
outstanding performance and fast inference speed. Therefore, in this paper, we
took advantage of the Mask-CTC NAR ASR framework to tackle the CS speech
recognition issue. We propose changing the Mandarin output target of the
encoder to Pinyin for faster encoder training, and introduce Pinyin-to-Mandarin
decoder to learn contextualized information. Moreover, we propose word
embedding label smoothing to regularize the decoder with contextualized
information and projection matrix regularization to bridge that gap between the
encoder and decoder. We evaluate the proposed methods on the SEAME corpus and
achieved exciting results.
Related papers
- Using Large Language Model for End-to-End Chinese ASR and NER [35.876792804001646]
We present an encoder-decoder architecture that incorporates speech features through cross-attention.
We compare these two approaches using Chinese automatic speech recognition (ASR) and name entity recognition (NER) tasks.
Our experiments reveal that encoder-decoder architecture outperforms decoder-only architecture with a short context.
arXiv Detail & Related papers (2024-01-21T03:15:05Z) - Leveraging Language ID to Calculate Intermediate CTC Loss for Enhanced
Code-Switching Speech Recognition [5.3545957730615905]
We introduce language identification information into the middle layer of the ASR model's encoder.
We aim to generate acoustic features that imply language distinctions in a more implicit way, reducing the model's confusion when dealing with language switching.
arXiv Detail & Related papers (2023-12-15T07:46:35Z) - Linguistic-Enhanced Transformer with CTC Embedding for Speech
Recognition [29.1423215212174]
Recent emergence of joint CTC-Attention model shows significant improvement in automatic speech recognition (ASR)
We propose linguistic-enhanced transformer, which introduces refined CTC information to decoder during training process.
Experiments on AISHELL-1 speech corpus show that the character error rate (CER) is relatively reduced by up to 7%.
arXiv Detail & Related papers (2022-10-25T08:12:59Z) - Code-Switching without Switching: Language Agnostic End-to-End Speech
Translation [68.8204255655161]
We treat speech recognition and translation as one unified end-to-end speech translation problem.
By training LAST with both input languages, we decode speech into one target language, regardless of the input language.
arXiv Detail & Related papers (2022-10-04T10:34:25Z) - LAE: Language-Aware Encoder for Monolingual and Multilingual ASR [87.74794847245536]
A novel language-aware encoder (LAE) architecture is proposed to handle both situations by disentangling language-specific information.
Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level.
arXiv Detail & Related papers (2022-06-05T04:03:12Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Adversarial Neural Networks for Error Correcting Codes [76.70040964453638]
We introduce a general framework to boost the performance and applicability of machine learning (ML) models.
We propose to combine ML decoders with a competing discriminator network that tries to distinguish between codewords and noisy words.
Our framework is game-theoretic, motivated by generative adversarial networks (GANs)
arXiv Detail & Related papers (2021-12-21T19:14:44Z) - Transformer-Transducers for Code-Switched Speech Recognition [23.281314397784346]
We present an end-to-end ASR system using a transformer-transducer model architecture for code-switched speech recognition.
First, we introduce two auxiliary loss functions to handle the low-resource scenario of code-switching.
Second, we propose a novel mask-based training strategy with language ID information to improve the label encoder training towards intra-sentential code-switching.
arXiv Detail & Related papers (2020-11-30T17:27:41Z) - Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task.
Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.