Cascade RNN-Transducer: Syllable Based Streaming On-device Mandarin
Speech Recognition with a Syllable-to-Character Converter
- URL: http://arxiv.org/abs/2011.08469v1
- Date: Tue, 17 Nov 2020 06:42:47 GMT
- Title: Cascade RNN-Transducer: Syllable Based Streaming On-device Mandarin
Speech Recognition with a Syllable-to-Character Converter
- Authors: Xiong Wang, Zhuoyuan Yao, Xian Shi, Lei Xie
- Abstract summary: This paper proposes a novel cascade RNN-T approach to improve the language modeling ability of RNN-T.
By introducing several important tricks, the cascade RNN-T approach surpasses the character-based RNN-T by a large margin on several Mandarin test sets.
- Score: 10.262490936452688
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end models are favored in automatic speech recognition (ASR) because
of its simplified system structure and superior performance. Among these
models, recurrent neural network transducer (RNN-T) has achieved significant
progress in streaming on-device speech recognition because of its high-accuracy
and low-latency. RNN-T adopts a prediction network to enhance language
information, but its language modeling ability is limited because it still
needs paired speech-text data to train. Further strengthening the language
modeling ability through extra text data, such as shallow fusion with an
external language model, only brings a small performance gain. In view of the
fact that Mandarin Chinese is a character-based language and each character is
pronounced as a tonal syllable, this paper proposes a novel cascade RNN-T
approach to improve the language modeling ability of RNN-T. Our approach
firstly uses an RNN-T to transform acoustic feature into syllable sequence, and
then converts the syllable sequence into character sequence through an
RNN-T-based syllable-to-character converter. Thus a rich text repository can be
easily used to strengthen the language model ability. By introducing several
important tricks, the cascade RNN-T approach surpasses the character-based
RNN-T by a large margin on several Mandarin test sets, with much higher
recognition quality and similar latency.
Related papers
- CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Speech recognition for air traffic control via feature learning and
end-to-end training [8.755785876395363]
We propose a new automatic speech recognition (ASR) system based on feature learning and an end-to-end training procedure for air traffic control (ATC) systems.
The proposed model integrates the feature learning block, recurrent neural network (RNN), and connectionist temporal classification loss.
Thanks to the ability to learn representations from raw waveforms, the proposed model can be optimized in a complete end-to-end manner.
arXiv Detail & Related papers (2021-11-04T06:38:21Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Investigation of learning abilities on linguistic features in
sequence-to-sequence text-to-speech synthesis [48.151894340550385]
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes.
We investigate under what conditions the neural sequence-to-sequence TTS can work well in Japanese and English.
arXiv Detail & Related papers (2020-05-20T23:26:14Z) - Exploring Pre-training with Alignments for RNN Transducer based
End-to-End Speech Recognition [39.497407288772386]
recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research.
In this work, we leverage external alignments to seed the RNN-T model.
Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively.
arXiv Detail & Related papers (2020-05-01T19:00:57Z) - Rnn-transducer with language bias for end-to-end Mandarin-English
code-switching speech recognition [58.105818353866354]
We propose an improved recurrent neural network transducer (RNN-T) model with language bias to alleviate the problem.
We use the language identities to bias the model to predict the CS points.
This promotes the model to learn the language identity information directly from transcription, and no additional LID model is needed.
arXiv Detail & Related papers (2020-02-19T12:01:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.