PM-MMUT: Boosted Phone-mask Data Augmentation using Multi-modeing Unit
Training for Robust Uyghur E2E Speech Recognition
- URL: http://arxiv.org/abs/2112.06721v1
- Date: Mon, 13 Dec 2021 15:04:33 GMT
- Title: PM-MMUT: Boosted Phone-mask Data Augmentation using Multi-modeing Unit
Training for Robust Uyghur E2E Speech Recognition
- Authors: Guodong Ma, Pengfei Hu, Nurmemet Yolwas, Shen Huang, Hao Huang
- Abstract summary: Consonant and vowel reduction might cause performance degradation in Uyghur automatic speech recognition.
We propose multi-modeling unit training (MMUT) architecture fusion with PMT LibriPM-MMUT to boost the performance of PMT.
Experi-mental results on Uyghur ASR show that the proposed approaches improve significantly, outperforming the pure PMT.
- Score: 5.412341237841356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Consonant and vowel reduction are often encountered in Uyghur speech, which
might cause performance degradation in Uyghur automatic speech recognition
(ASR). Our recently proposed learning strategy based on masking, Phone Masking
Training (PMT), alleviates the impact of such phenomenon in Uyghur ASR.
Although PMT achieves remarkably improvements, there still exists room for
further gains due to the granularity mismatch between masking unit of PMT
(phoneme) and modeling unit (word-piece). To boost the performance of PMT, we
propose multi-modeling unit training (MMUT) architecture fusion with PMT
(PM-MMUT). The idea of MMUT framework is to split the Encoder into two parts
including acoustic feature sequences to phoneme-level representation
(AF-to-PLR) and phoneme-level representation to word-piece-level representation
(PLR-to-WPLR). It allows AF-to-PLR to be optimized by an intermediate
phoneme-based CTC loss to learn the rich phoneme-level context information
brought by PMT. Experi-mental results on Uyghur ASR show that the proposed
approaches improve significantly, outperforming the pure PMT (reduction WER
from 24.0 to 23.7 on Read-Test and from 38.4 to 36.8 on Oral-Test
respectively). We also conduct experiments on the 960-hour Librispeech
benchmark using ESPnet1, which achieves about 10% relative WER reduction on all
the test sets without LM fusion comparing with the latest official ESPnet1
pre-trained model.
Related papers
- TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation [19.126525226518975]
We propose a speech separation model with significantly reduced parameters and computational costs.
TIGER leverages prior knowledge to divide frequency bands and compresses frequency information.
We show that TIGER achieves performance surpassing state-of-the-art (SOTA) model TF-GridNet.
arXiv Detail & Related papers (2024-10-02T12:21:06Z) - SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z) - JEIT: Joint End-to-End Model and Internal Language Model Training for
Speech Recognition [63.38229762589485]
We propose a joint end-to-end (E2E) model and internal language model (ILM) training method to inject large-scale unpaired text into ILM.
With 100B unpaired sentences, JEIT/CJJT improves rare-word recognition accuracy by up to 16.4% over a model trained without unpaired text.
arXiv Detail & Related papers (2023-02-16T21:07:38Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Unified End-to-End Speech Recognition and Endpointing for Fast and
Efficient Speech Systems [17.160006765475988]
We propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) model.
We introduce a "switch" connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model.
This results in a single E2E model that can be used during inference to perform frame filtering at low cost.
arXiv Detail & Related papers (2022-11-01T23:43:15Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition [78.67749936030219]
Prune-Adjust- Re-Prune (PARP) discovers and finetunesworks for much better ASR performance.
Experiments on low-resource English and multi-lingual ASR show sparseworks exist in pre-trained speech SSL.
arXiv Detail & Related papers (2021-06-10T17:32:25Z) - Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting
Transcription with Single Distant Microphone [43.77139614544301]
Transcribing meetings containing overlapped speech with only a single distant microphone (SDM) has been one of the most challenging problems for automatic speech recognition (ASR)
In this paper, we extensively investigate a two-step approach where we first pre-train a serialized output training (SOT)-based multi-talker ASR.
With fine-tuning on the 70 hours of the AMI-SDM training data, our SOT ASR model achieves a word error rate (WER) of 21.2% for the AMI-SDM evaluation set.
arXiv Detail & Related papers (2021-03-31T02:43:32Z) - Gated Recurrent Fusion with Joint Training Framework for Robust
End-to-End Speech Recognition [64.9317368575585]
This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR.
The GRF algorithm is used to dynamically combine the noisy and enhanced features.
The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
arXiv Detail & Related papers (2020-11-09T08:52:05Z) - Multimodal Semi-supervised Learning Framework for Punctuation Prediction
in Conversational Speech [17.602098162338137]
We explore a multimodal semi-supervised learning approach for punctuation prediction.
We learn representations from large amounts of unlabelled audio and text data.
When trained on 1 hour of speech and text data, the proposed model achieved 9-18% absolute improvement over baseline model.
arXiv Detail & Related papers (2020-08-03T08:13:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.