Improved Consistency Training for Semi-Supervised Sequence-to-Sequence
ASR via Speech Chain Reconstruction and Self-Transcribing
- URL: http://arxiv.org/abs/2205.06963v1
- Date: Sat, 14 May 2022 04:26:13 GMT
- Title: Improved Consistency Training for Semi-Supervised Sequence-to-Sequence
ASR via Speech Chain Reconstruction and Self-Transcribing
- Authors: Heli Qi, Sashi Novitasari, Sakriani Sakti, Satoshi Nakamura
- Abstract summary: We propose an improved consistency training paradigm of semi-supervised S2S ASR.
We utilize speech chain reconstruction as the weak augmentation to generate high-quality pseudo labels.
Our improved paradigm achieves a 12.2% CER improvement in the single-speaker setting and 38.6% in the multi-speaker setting.
- Score: 21.049557187137776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Consistency regularization has recently been applied to semi-supervised
sequence-to-sequence (S2S) automatic speech recognition (ASR). This principle
encourages an ASR model to output similar predictions for the same input speech
with different perturbations. The existing paradigm of semi-supervised S2S ASR
utilizes SpecAugment as data augmentation and requires a static teacher model
to produce pseudo transcripts for untranscribed speech. However, this paradigm
fails to take full advantage of consistency regularization. First, the masking
operations of SpecAugment may damage the linguistic contents of the speech,
thus influencing the quality of pseudo labels. Second, S2S ASR requires both
input speech and prefix tokens to make the next prediction. The static prefix
tokens made by the offline teacher model cannot match dynamic pseudo labels
during consistency training. In this work, we propose an improved consistency
training paradigm of semi-supervised S2S ASR. We utilize speech chain
reconstruction as the weak augmentation to generate high-quality pseudo labels.
Moreover, we demonstrate that dynamic pseudo transcripts produced by the
student ASR model benefit the consistency training. Experiments on LJSpeech and
LibriSpeech corpora show that compared to supervised baselines, our improved
paradigm achieves a 12.2% CER improvement in the single-speaker setting and
38.6% in the multi-speaker setting.
Related papers
- Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices [8.77712061194924]
We present a finite-state transducer (FST) technique for rewriting wordpiece lattices generated by Transformer-based CTC models.
Our algorithm performs grapheme-to-phoneme (G2P) conversion directly from wordpieces into phonemes, avoiding explicit word representations.
We achieved up to a 15.2% relative reduction in sentence error rate (SER) on a test set with contextually relevant entities.
arXiv Detail & Related papers (2024-09-24T21:42:25Z) - Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP [18.90593650641679]
A two-stage automatic annotation pipeline is proposed in this paper.
In the first stage, we use contrastive pretraining of Speech-Silence and Word-Punctuation pairs to enhance prosodic information in latent representations.
In the second stage, we build a multi-modal prosody annotator, comprising pretrained encoders, a text-speech fusing scheme, and a sequence classifier.
Experiments on English prosodic boundaries demonstrate that our method achieves state-of-the-art (SOTA) performance with 0.72 and 0.93 f1 score for Prosodic Word and Prosodic Phrase
arXiv Detail & Related papers (2023-09-11T12:50:28Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - Improving Sequence-to-Sequence Pre-training via Sequence Span Rewriting [54.03356526990088]
We propose Sequence Span Rewriting (SSR) as a self-supervised sequence-to-sequence (seq2seq) pre-training objective.
SSR provides more fine-grained learning signals for text representations by supervising the model to rewrite imperfect spans to ground truth.
Our experiments with T5 models on various seq2seq tasks show that SSR can substantially improve seq2seq pre-training.
arXiv Detail & Related papers (2021-01-02T10:27:11Z) - Sequence-to-Sequence Learning via Attention Transfer for Incremental
Speech Recognition [25.93405777713522]
We investigate whether it is possible to employ the original architecture of attention-based ASR for ISR tasks.
We design an alternative student network that, instead of using a thinner or a shallower model, keeps the original architecture of the teacher model but with shorter sequences.
Our experiments show that by delaying the starting time of recognition process with about 1.7 sec, we can achieve comparable performance to one that needs to wait until the end.
arXiv Detail & Related papers (2020-11-04T05:06:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.