Pronunciation-aware unique character encoding for RNN Transducer-based
Mandarin speech recognition
- URL: http://arxiv.org/abs/2207.14578v1
- Date: Fri, 29 Jul 2022 09:49:10 GMT
- Title: Pronunciation-aware unique character encoding for RNN Transducer-based
Mandarin speech recognition
- Authors: Peng Shen, Xugang Lu, Hisashi Kawai
- Abstract summary: We propose to use a novel pronunciation-aware unique character encoding for building E2E RNN-T-based Mandarin ASR systems.
The proposed encoding is a combination of pronunciation-base syllable and character index (CI)
- Score: 38.60303603000269
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: For Mandarin end-to-end (E2E) automatic speech recognition (ASR) tasks,
compared to character-based modeling units, pronunciation-based modeling units
could improve the sharing of modeling units in model training but meet
homophone problems. In this study, we propose to use a novel
pronunciation-aware unique character encoding for building E2E RNN-T-based
Mandarin ASR systems. The proposed encoding is a combination of
pronunciation-base syllable and character index (CI). By introducing the CI,
the RNN-T model can overcome the homophone problem while utilizing the
pronunciation information for extracting modeling units. With the proposed
encoding, the model outputs can be converted into the final recognition result
through a one-to-one mapping. We conducted experiments on Aishell and MagicData
datasets, and the experimental results showed the effectiveness of the proposed
method.
Related papers
- Syllable based DNN-HMM Cantonese Speech to Text System [3.976127530758402]
This paper builds up a Cantonese Speech-to-Text (STT) system with a syllable based acoustic model.
The ONC-based syllable acoustic modeling achieves the best performance with the word error rate (WER) of 9.66% and the real time factor (RTF) of 1.38812.
arXiv Detail & Related papers (2024-02-13T20:54:24Z) - SelfVC: Voice Conversion With Iterative Refinement using Self Transformations [42.97689861071184]
SelfVC is a training strategy to improve a voice conversion model with self-synthesized examples.
We develop techniques to derive prosodic information from the audio signal and SSL representations to train predictive submodules in the synthesis model.
Our framework is trained without any text and achieves state-of-the-art results in zero-shot voice conversion on metrics evaluating naturalness, speaker similarity, and intelligibility of synthesized audio.
arXiv Detail & Related papers (2023-10-14T19:51:17Z) - Unified model for code-switching speech recognition and language
identification based on a concatenated tokenizer [17.700515986659063]
Code-Switching (CS) multilingual Automatic Speech Recognition (ASR) models can transcribe speech containing two or more alternating languages during a conversation.
This paper proposes a new method for creating code-switching ASR datasets from purely monolingual data sources.
A novel Concatenated Tokenizer enables ASR models to generate language ID for each emitted text token while reusing existing monolingual tokenizers.
arXiv Detail & Related papers (2023-06-14T21:24:11Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition [9.930655347717932]
In mandarin scenarios, the Chinese characters represent meaning but are not directly related to the pronunciation.
We present a novel method involving with multi-level modeling units, which integrates multi-level information for mandarin speech recognition.
arXiv Detail & Related papers (2022-05-24T11:43:54Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Speech recognition for air traffic control via feature learning and
end-to-end training [8.755785876395363]
We propose a new automatic speech recognition (ASR) system based on feature learning and an end-to-end training procedure for air traffic control (ATC) systems.
The proposed model integrates the feature learning block, recurrent neural network (RNN), and connectionist temporal classification loss.
Thanks to the ability to learn representations from raw waveforms, the proposed model can be optimized in a complete end-to-end manner.
arXiv Detail & Related papers (2021-11-04T06:38:21Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Rnn-transducer with language bias for end-to-end Mandarin-English
code-switching speech recognition [58.105818353866354]
We propose an improved recurrent neural network transducer (RNN-T) model with language bias to alleviate the problem.
We use the language identities to bias the model to predict the CS points.
This promotes the model to learn the language identity information directly from transcription, and no additional LID model is needed.
arXiv Detail & Related papers (2020-02-19T12:01:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.