SCaLa: Supervised Contrastive Learning for End-to-End Automatic Speech
Recognition
- URL: http://arxiv.org/abs/2110.04187v1
- Date: Fri, 8 Oct 2021 15:15:38 GMT
- Title: SCaLa: Supervised Contrastive Learning for End-to-End Automatic Speech
Recognition
- Authors: Li Fu, Xiaoxiao Li, Runyu Wang, Zhengchen Zhang, Youzheng Wu, Xiaodong
He, Bowen Zhou
- Abstract summary: This paper proposes a novel framework of Supervised Contrastive Learning (SCaLa) to enhance phonemic information learning for end-to-end ASR systems.
Specifically, we introduce the self-supervised Masked Contrastive Predictive Coding (MCPC) into the fully-supervised setting.
To supervise phoneme learning explicitly, SCaLa first masks the variable-length encoder features corresponding to phonemes given phoneme forced-alignment extracted from a pre-trained acoustic model, and then predicts the masked phonemes via contrastive learning.
- Score: 36.766303689895686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end Automatic Speech Recognition (ASR) models are usually trained to
reduce the losses of the whole token sequences, while neglecting explicit
phonemic-granularity supervision. This could lead to recognition errors due to
similar-phoneme confusion or phoneme reduction. To alleviate this problem, this
paper proposes a novel framework of Supervised Contrastive Learning (SCaLa) to
enhance phonemic information learning for end-to-end ASR systems. Specifically,
we introduce the self-supervised Masked Contrastive Predictive Coding (MCPC)
into the fully-supervised setting. To supervise phoneme learning explicitly,
SCaLa first masks the variable-length encoder features corresponding to
phonemes given phoneme forced-alignment extracted from a pre-trained acoustic
model, and then predicts the masked phonemes via contrastive learning. The
phoneme forced-alignment can mitigate the noise of positive-negative pairs in
self-supervised MCPC. Experimental results conducted on reading and spontaneous
speech datasets show that the proposed approach achieves 2.84% and 1.38%
Character Error Rate (CER) reductions compared to the baseline, respectively.
Related papers
- High-Quality Automatic Voice Over with Accurate Alignment: Supervision
through Self-Supervised Discrete Speech Units [69.06657692891447]
We propose a novel AVO method leveraging the learning objective of self-supervised discrete speech unit prediction.
Experimental results show that our proposed method achieves remarkable lip-speech synchronization and high speech quality.
arXiv Detail & Related papers (2023-06-29T15:02:22Z) - End-to-end spoken language understanding using joint CTC loss and
self-supervised, pretrained acoustic encoders [13.722028186368737]
We leverage self-supervised acoustic encoders fine-tuned with Connectionist Temporal Classification to extract textual embeddings.
Our model achieves 4% absolute improvement over the the state-of-the-art (SOTA) dialogue act classification model on the DSTC2 dataset.
arXiv Detail & Related papers (2023-05-04T15:36:37Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Multistream neural architectures for cued-speech recognition using a
pre-trained visual feature extractor and constrained CTC decoding [0.0]
Cued Speech (CS) is a visual communication tool that helps people with hearing impairment to understand spoken language.
The proposed approach is based on a pre-trained hand and lips tracker used for visual feature extraction and a phonetic decoder based on a multistream recurrent neural network.
With a decoding accuracy at the phonetic level of 70.88%, the proposed system outperforms our previous CNN-HMM decoder and competes with more complex baselines.
arXiv Detail & Related papers (2022-04-11T09:30:08Z) - Prosodic Clustering for Phoneme-level Prosody Control in End-to-End
Speech Synthesis [49.6007376399981]
We present a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system.
The proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration.
By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.
arXiv Detail & Related papers (2021-11-19T12:10:16Z) - Hallucination of speech recognition errors with sequence to sequence
learning [16.39332236910586]
When plain text data is to be used to train systems for spoken language understanding or ASR, a proven strategy is to hallucinate what the ASR outputs would be given a gold transcription.
We present novel end-to-end models to directly predict hallucinated ASR word sequence outputs, conditioning on an input word sequence as well as a corresponding phoneme sequence.
This improves prior published results for recall of errors from an in-domain ASR system's transcription of unseen data, as well as an out-of-domain ASR system's transcriptions of audio from an unrelated task.
arXiv Detail & Related papers (2021-03-23T02:09:39Z) - UniSpeech: Unified Speech Representation Learning with Labeled and
Unlabeled Data [54.733889961024445]
We propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data.
We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus.
arXiv Detail & Related papers (2021-01-19T12:53:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.