Research on Modeling Units of Transformer Transducer for Mandarin Speech
Recognition
- URL: http://arxiv.org/abs/2004.13522v1
- Date: Sun, 26 Apr 2020 05:12:52 GMT
- Title: Research on Modeling Units of Transformer Transducer for Mandarin Speech
Recognition
- Authors: Li Fu, Xiaoxiao Li, Libo Zi
- Abstract summary: We propose a novel transformer transducer with the combination architecture of self-attention transformer and RNN.
Experiments are conducted on about 12,000 hours of Mandarin speech with sampling rate in 8kHz and 16kHz.
It yields an average of 14.4% and 44.1% relative Word Error Rate (WER) reduction when compared with the models using syllable initial/final with tone and Chinese character.
- Score: 13.04590477394637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modeling unit and model architecture are two key factors of Recurrent Neural
Network Transducer (RNN-T) in end-to-end speech recognition. To improve the
performance of RNN-T for Mandarin speech recognition task, a novel transformer
transducer with the combination architecture of self-attention transformer and
RNN is proposed. And then the choice of different modeling units for
transformer transducer is explored. In addition, we present a new mix-bandwidth
training method to obtain a general model that is able to accurately recognize
Mandarin speech with different sampling rates simultaneously. All of our
experiments are conducted on about 12,000 hours of Mandarin speech with
sampling rate in 8kHz and 16kHz. Experimental results show that Mandarin
transformer transducer using syllable with tone achieves the best performance.
It yields an average of 14.4% and 44.1% relative Word Error Rate (WER)
reduction when compared with the models using syllable initial/final with tone
and Chinese character, respectively. Also, it outperforms the model based on
syllable initial/final with tone with an average of 13.5% relative Character
Error Rate (CER) reduction.
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Knowledge Distillation from BERT Transformer to Speech Transformer for
Intent Classification [66.62686601948455]
We exploit the scope of the transformer distillation method that is specifically designed for knowledge distillation from a transformer based language model to a transformer based speech model.
We achieve an intent classification accuracy of 99.10% and 88.79% for Fluent speech corpus and ATIS database, respectively.
arXiv Detail & Related papers (2021-08-05T13:08:13Z) - Multitask Learning and Joint Optimization for Transformer-RNN-Transducer
Speech Recognition [13.198689566654107]
This paper explores multitask learning, joint optimization, and joint decoding methods for transformer-RNN-transducer systems.
We show that the proposed methods can reduce word error rate (WER) by 16.6 % and 13.3 % for test-clean and test-other datasets, respectively.
arXiv Detail & Related papers (2020-11-02T06:38:06Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR)
We propose the convolution-augmented transformer for speech recognition, named Conformer.
On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z) - End-to-End Multi-speaker Speech Recognition with Transformer [88.22355110349933]
We replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture.
We also modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation.
arXiv Detail & Related papers (2020-02-10T16:29:26Z) - EEG based Continuous Speech Recognition using Transformers [13.565270550358397]
We investigate continuous speech recognition using electroencephalography (EEG) features using end-to-end transformer based automatic speech recognition (ASR) model.
Our results demonstrate that transformer based model demonstrate faster training compared to recurrent neural network (RNN) based sequence-to-sequence EEG models.
arXiv Detail & Related papers (2019-12-31T08:36:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.