Dual-Decoder Transformer For end-to-end Mandarin Chinese Speech
Recognition with Pinyin and Character
- URL: http://arxiv.org/abs/2201.10792v1
- Date: Wed, 26 Jan 2022 07:59:03 GMT
- Title: Dual-Decoder Transformer For end-to-end Mandarin Chinese Speech
Recognition with Pinyin and Character
- Authors: Zhao Yang, Wei Xi, Rui Wang, Rui Jiang and Jizhong Zhao
- Abstract summary: Pinyin and character as writing and spelling systems respectively are mutual promotion in the Mandarin Chinese language.
We propose a novel Mandarin Chinese ASR model with dual-decoder Transformer according to the characteristics of pinyin transcripts and character transcripts.
The results on the test sets of AISHELL-1 dataset show that the proposed Speech-Pinyin-Character-Interaction (S PCI) model without a language model achieves 9.85% character error rate (CER) on the test set.
- Score: 15.999657143705045
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: End-to-end automatic speech recognition (ASR) has achieved promising results.
However, most existing end-to-end ASR methods neglect the use of specific
language characteristics. For Mandarin Chinese ASR tasks, pinyin and character
as writing and spelling systems respectively are mutual promotion in the
Mandarin Chinese language. Based on the above intuition, we investigate types
of related models that are suitable but not for joint pinyin-character ASR and
propose a novel Mandarin Chinese ASR model with dual-decoder Transformer
according to the characteristics of the pinyin transcripts and character
transcripts. Specifically, the joint pinyin-character layer-wise linear
interactive (LWLI) module and phonetic posteriorgrams adapter (PPGA) are
proposed to achieve inter-layer multi-level interaction by adaptively fusing
pinyin and character information. Furthermore, a two-stage training strategy is
proposed to make training more stable and faster convergence. The results on
the test sets of AISHELL-1 dataset show that the proposed
Speech-Pinyin-Character-Interaction (SPCI) model without a language model
achieves 9.85% character error rate (CER) on the test set, which is 17.71%
relative reduction compared to baseline models based on Transformer.
Related papers
- Large Language Model Should Understand Pinyin for Chinese ASR Error Correction [31.13523648668466]
We propose Pinyin-enhanced GEC to improve Chinese ASR error correction.
Our approach only utilizes synthetic errors for training and employs the one-best hypothesis during inference.
Experiments on the Aishell-1 and the Common Voice datasets demonstrate that our approach consistently outperforms GEC with text-only input.
arXiv Detail & Related papers (2024-09-20T06:50:56Z) - Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through
Image-IDS Aligning [61.34060587461462]
We propose a two-stage framework for Chinese Text Recognition (CTR)
We pre-train a CLIP-like model through aligning printed character images and Ideographic Description Sequences (IDS)
This pre-training stage simulates humans recognizing Chinese characters and obtains the canonical representation of each character.
The learned representations are employed to supervise the CTR model, such that traditional single-character recognition can be improved to text-line recognition.
arXiv Detail & Related papers (2023-09-03T05:33:16Z) - Unified model for code-switching speech recognition and language
identification based on a concatenated tokenizer [17.700515986659063]
Code-Switching (CS) multilingual Automatic Speech Recognition (ASR) models can transcribe speech containing two or more alternating languages during a conversation.
This paper proposes a new method for creating code-switching ASR datasets from purely monolingual data sources.
A novel Concatenated Tokenizer enables ASR models to generate language ID for each emitted text token while reusing existing monolingual tokenizers.
arXiv Detail & Related papers (2023-06-14T21:24:11Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input
Noises [87.70001456418504]
We construct READIN: a Chinese multi-task benchmark with REalistic And Diverse Input Noises.
READIN contains four diverse tasks and requests annotators to re-enter the original test data with two commonly used Chinese input methods: Pinyin input and speech input.
We experiment with a series of strong pretrained language models as well as robust training methods, we find that these models often suffer significant performance drops on READIN.
arXiv Detail & Related papers (2023-02-14T20:14:39Z) - Cross-Modal Mutual Learning for Cued Speech Recognition [10.225972737967249]
We propose a transformer based cross-modal mutual learning framework to prompt multi-modal interaction.
Our model forces modality-specific information of different modalities to pass through a modality-invariant codebook.
We establish a novel large-scale multi-speaker CS dataset for Mandarin Chinese.
arXiv Detail & Related papers (2022-12-02T10:45:33Z) - Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition [9.930655347717932]
In mandarin scenarios, the Chinese characters represent meaning but are not directly related to the pronunciation.
We present a novel method involving with multi-level modeling units, which integrates multi-level information for mandarin speech recognition.
arXiv Detail & Related papers (2022-05-24T11:43:54Z) - ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin
Information [32.70080326854314]
We propose ChineseBERT, which incorporates the glyph and pinyin information of Chinese characters into language model pretraining.
The proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps.
arXiv Detail & Related papers (2021-06-30T13:06:00Z) - How Phonotactics Affect Multilingual and Zero-shot ASR Performance [74.70048598292583]
A Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training.
We replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM.
We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer.
arXiv Detail & Related papers (2020-10-22T23:07:24Z) - 2kenize: Tying Subword Sequences for Chinese Script Conversion [54.33749520569979]
We propose a model that can disambiguate between mappings and convert between the two scripts.
Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy.
arXiv Detail & Related papers (2020-05-07T10:53:05Z) - Rnn-transducer with language bias for end-to-end Mandarin-English
code-switching speech recognition [58.105818353866354]
We propose an improved recurrent neural network transducer (RNN-T) model with language bias to alleviate the problem.
We use the language identities to bias the model to predict the CS points.
This promotes the model to learn the language identity information directly from transcription, and no additional LID model is needed.
arXiv Detail & Related papers (2020-02-19T12:01:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.