Syllable based DNN-HMM Cantonese Speech to Text System
- URL: http://arxiv.org/abs/2402.08788v1
- Date: Tue, 13 Feb 2024 20:54:24 GMT
- Title: Syllable based DNN-HMM Cantonese Speech to Text System
- Authors: Timothy Wong and Claire Li and Sam Lam and Billy Chiu and Qin Lu and
Minglei Li and Dan Xiong and Roy Shing Yu and Vincent T.Y. Ng
- Abstract summary: This paper builds up a Cantonese Speech-to-Text (STT) system with a syllable based acoustic model.
The ONC-based syllable acoustic modeling achieves the best performance with the word error rate (WER) of 9.66% and the real time factor (RTF) of 1.38812.
- Score: 3.976127530758402
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper reports our work on building up a Cantonese Speech-to-Text (STT)
system with a syllable based acoustic model. This is a part of an effort in
building a STT system to aid dyslexic students who have cognitive deficiency in
writing skills but have no problem expressing their ideas through speech. For
Cantonese speech recognition, the basic unit of acoustic models can either be
the conventional Initial-Final (IF) syllables, or the Onset-Nucleus-Coda (ONC)
syllables where finals are further split into nucleus and coda to reflect the
intra-syllable variations in Cantonese. By using the Kaldi toolkit, our system
is trained using the stochastic gradient descent optimization model with the
aid of GPUs for the hybrid Deep Neural Network and Hidden Markov Model
(DNN-HMM) with and without I-vector based speaker adaptive training technique.
The input features of the same Gaussian Mixture Model with speaker adaptive
training (GMM-SAT) to DNN are used in all cases. Experiments show that the
ONC-based syllable acoustic modeling with I-vector based DNN-HMM achieves the
best performance with the word error rate (WER) of 9.66% and the real time
factor (RTF) of 1.38812.
Related papers
- Quartered Spectral Envelope and 1D-CNN-based Classification of Normally Phonated and Whispered Speech [0.0]
The presence of pitch and pitch harmonics in normal speech, and its absence in whispered speech, is evident in the spectral envelope of the Fourier transform.
We propose the use of one dimensional convolutional neural networks (1D-CNN) to capture these features.
The system yields an accuracy of 99.31% when trained and tested on the wTIMIT dataset, and 100% on the CHAINS dataset.
arXiv Detail & Related papers (2024-08-25T07:17:11Z) - Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS)
Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model.
Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Pronunciation-aware unique character encoding for RNN Transducer-based
Mandarin speech recognition [38.60303603000269]
We propose to use a novel pronunciation-aware unique character encoding for building E2E RNN-T-based Mandarin ASR systems.
The proposed encoding is a combination of pronunciation-base syllable and character index (CI)
arXiv Detail & Related papers (2022-07-29T09:49:10Z) - Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE [36.50265124324876]
We propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs.
The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference.
Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.
arXiv Detail & Related papers (2022-06-06T11:51:22Z) - ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis [13.676243543864347]
We propose an end-to-end method that is able to generate high-quality speech and better similarity for both seen and unseen speakers.
The method consists of three separately trained components: a speaker encoder based on the state-of-the-art TDNN-based ECAPA-TDNN, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder.
arXiv Detail & Related papers (2022-03-20T07:04:26Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Audio-Visual Decision Fusion for WFST-based and seq2seq Models [3.2771898634434997]
Under noisy conditions, speech recognition systems suffer from high Word Error Rates (WER)
We propose novel methods to fuse information from audio and visual modalities at inference time.
We show that our methods give significant improvements over acoustic-only WER.
arXiv Detail & Related papers (2020-01-29T13:45:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.