Syllable based DNN-HMM Cantonese Speech to Text System
- URL: http://arxiv.org/abs/2402.08788v1
- Date: Tue, 13 Feb 2024 20:54:24 GMT
- Title: Syllable based DNN-HMM Cantonese Speech to Text System
- Authors: Timothy Wong and Claire Li and Sam Lam and Billy Chiu and Qin Lu and
Minglei Li and Dan Xiong and Roy Shing Yu and Vincent T.Y. Ng
- Abstract summary: This paper builds up a Cantonese Speech-to-Text (STT) system with a syllable based acoustic model.
The ONC-based syllable acoustic modeling achieves the best performance with the word error rate (WER) of 9.66% and the real time factor (RTF) of 1.38812.
- Score: 3.976127530758402
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper reports our work on building up a Cantonese Speech-to-Text (STT)
system with a syllable based acoustic model. This is a part of an effort in
building a STT system to aid dyslexic students who have cognitive deficiency in
writing skills but have no problem expressing their ideas through speech. For
Cantonese speech recognition, the basic unit of acoustic models can either be
the conventional Initial-Final (IF) syllables, or the Onset-Nucleus-Coda (ONC)
syllables where finals are further split into nucleus and coda to reflect the
intra-syllable variations in Cantonese. By using the Kaldi toolkit, our system
is trained using the stochastic gradient descent optimization model with the
aid of GPUs for the hybrid Deep Neural Network and Hidden Markov Model
(DNN-HMM) with and without I-vector based speaker adaptive training technique.
The input features of the same Gaussian Mixture Model with speaker adaptive
training (GMM-SAT) to DNN are used in all cases. Experiments show that the
ONC-based syllable acoustic modeling with I-vector based DNN-HMM achieves the
best performance with the word error rate (WER) of 9.66% and the real time
factor (RTF) of 1.38812.
Related papers
- Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS)
Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model.
Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Pronunciation-aware unique character encoding for RNN Transducer-based
Mandarin speech recognition [38.60303603000269]
We propose to use a novel pronunciation-aware unique character encoding for building E2E RNN-T-based Mandarin ASR systems.
The proposed encoding is a combination of pronunciation-base syllable and character index (CI)
arXiv Detail & Related papers (2022-07-29T09:49:10Z) - ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis [13.676243543864347]
We propose an end-to-end method that is able to generate high-quality speech and better similarity for both seen and unseen speakers.
The method consists of three separately trained components: a speaker encoder based on the state-of-the-art TDNN-based ECAPA-TDNN, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder.
arXiv Detail & Related papers (2022-03-20T07:04:26Z) - Revisiting joint decoding based multi-talker speech recognition with DNN
acoustic model [34.061441900912136]
We argue that such a scheme is sub-optimal and propose a principled solution that decodes all speakers jointly.
We modify the acoustic model to predict joint state posteriors for all speakers, enabling the network to express uncertainty about the attribution of parts of the speech signal to the speakers.
arXiv Detail & Related papers (2021-10-31T09:28:04Z) - DNN-Based Semantic Model for Rescoring N-best Speech Recognition List [8.934497552812012]
The word error rate (WER) of an automatic speech recognition (ASR) system increases when a mismatch occurs between the training and the testing conditions due to the noise, etc.
This work aims to improve ASR by modeling long-term semantic relations to compensate for distorted acoustic features.
arXiv Detail & Related papers (2020-11-02T13:50:59Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Audio-Visual Decision Fusion for WFST-based and seq2seq Models [3.2771898634434997]
Under noisy conditions, speech recognition systems suffer from high Word Error Rates (WER)
We propose novel methods to fuse information from audio and visual modalities at inference time.
We show that our methods give significant improvements over acoustic-only WER.
arXiv Detail & Related papers (2020-01-29T13:45:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.